[Oct 11, 2022] Professional-Data-Engineer Free Exam Questions with Quality Guaranteed
Professional-Data-Engineer Free Exam Files Downloaded Instantly
Introduction
Data engineers are responsible for finding trends in data sets and developing algorithms to help make raw data more useful to the enterprise. This IT role requires a significant set of technical skills, including a deep knowledge of SQL database design and multiple programming languages They collect, transform, and visualize data. The Data Engineer designs, builds, maintains, and troubleshoots data processing systems with a particular emphasis on the security, reliability, fault-tolerance,scalability, fidelity, and efficiency of such systems.
Target Audience
The candidates for this certification are the data engineers or those aiming to become one. These individuals should have the capacity to allow data-driven decision-making through the collection, transformation, and publishing of data. They have the expertise in designing, building, and operationalizing secure data processing systems and monitoring the same. This is with the specific emphasis on compliance and security, fidelity and reliability, portability and flexibility, as well as efficiency and scalability.
NEW QUESTION 157
Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?
- A. Hashing
- B. Salting
- C. Field promotion
- D. Randomization
Answer: C
Explanation:
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-
series#ensure_that_your_row_key_avoids_hotspotting
NEW QUESTION 158
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- A. Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
- B. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
- C. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
- D. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. the column TS instead of the column DT from now on.
- E. Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
Answer: C
Explanation:
It's better to create a new table and delete old one when we are changing the datatype is permanent. View is not suitable because every time the query will run and additional charges will be applied.
NEW QUESTION 159
Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11] SELECT age FROM bigquery-public-data.noaa_gsod.gsod WHERE age != 99 AND_TABLE_SUFFIX = `1929' ORDER BY age DESC Which table name will make the SQL statement work correctly?
- A. `bigquery-public-data.noaa_gsod.gsod*`
- B. `bigquery-public-data.noaa_gsod.gsod`
- C. bigquery-public-data.noaa_gsod.gsod*
- D. `bigquery-public-data.noaa_gsod.gsod'*
Answer: A
Explanation:
It follows the correct wildcard syntax of enclosing the table name in backticks and including the * wildcard character.
NEW QUESTION 160
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your dat
a. What should you do?
- A. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
- B. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
- C. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket. Manually destroy the key previously used for encryption, and rotate the key once and rotate the key once.
- D. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
Answer: C
NEW QUESTION 161
You are deploying a new storage system for your mobile application, which is a media streaming service.
You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of
which can take on multiple values. For example, in the entity 'Movie'the property 'actors'and the
property 'tags' have multiple values but the property 'date released' does not. A typical query
would ask for all movies with actor=<actorname>ordered by date_releasedor all movies with
tag=Comedyordered by date_released. How should you avoid a combinatorial explosion in the
number of indexes?
- A. Manually configure the index in your index config as follows:

- B. Set the following in your entity options: exclude_from_indexes = 'actors, tags'
- C. Manually configure the index in your index config as follows:

- D. Set the following in your entity options: exclude_from_indexes = 'date_published'
Answer: A
NEW QUESTION 162
You are building an application to share financial market data with consumers, who will receive data feeds.
Data is collected from the markets in real time. Consumers will receive the data in the following ways:
* Real-time event stream
* ANSI SQL access to real-time stream and historical data
* Batch historical exports
Which solution should you use?
- A. Cloud Pub/Sub, Cloud Storage, BigQuery
- B. Cloud Pub/Sub, Cloud Dataproc, Cloud SQL
- C. Cloud Dataflow, Cloud SQL, Cloud Spanner
- D. Cloud Dataproc, Cloud Dataflow, BigQuery
Answer: C
NEW QUESTION 163
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
* Ensure secure and efficient transport and storage of telemetry data
* Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
* Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
* Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)
- A. Ensure all the tables are included in global dataset.
- B. Ensure each table is included in a dataset for a region.
- C. Adjust the settings for each table to allow a related region-based security group view access.
- D. Adjust the settings for each view to allow a related region-based security group view access.
- E. Adjust the settings for each dataset to allow a related region-based security group view access.
Answer: B,D
NEW QUESTION 164
You're training a model to predict housing prices based on an available dataset with real estate properties.
Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.
What should you do?
- A. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.
- B. Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization.
- C. Create a numeric column from a feature cross of latitude and longitude.
- D. Provide latitude and longitude as input vectors to your neural net.
Answer: B
Explanation:
Use L1 regularization when you need to assign greater importance to more influential features. It shrinks less important feature to 0.
L2 regularization performs better when all input features influence the output & all with the weights are of equal size.
NEW QUESTION 165
You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?
- A. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
- B. Include ORDER BY DESK on timestamp column and LIMIT to 1.
- C. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
- D. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
Answer: A
Explanation:
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
NEW QUESTION 166
Your neural network model is taking days to train. You want to increase the training speed. What can you do?
- A. Subsample your training dataset.
- B. Increase the number of input features to your model.
- C. Subsample your test dataset.
- D. Increase the number of layers in your neural network.
Answer: D
NEW QUESTION 167
When a Cloud Bigtable node fails, ____ is lost.
- A. the time dimension
- B. the last transaction
- C. no data
- D. all data
Answer: C
Explanation:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
NEW QUESTION 168
When a Cloud Bigtable node fails, ____ is lost.
- A. the time dimension
- B. the last transaction
- C. no data
- D. all data
Answer: C
Explanation:
Explanation
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION 169
You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include:
* Executing the transformations on a schedule
* Enabling non-developer analysts to modify transformations
* Providing a graphical tool for designing transformations
What should you do?
- A. Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query
- B. Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis
- C. Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes
- D. Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe.
Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery
Answer: B
NEW QUESTION 170
You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary dat
a. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?
- A. Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
- B. Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
- C. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
- D. Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
Answer: A
NEW QUESTION 171
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems. Which solutions should you choose?
- A. Cloud Speech-to-Text API
- B. Cloud AutoML Natural Language
- C. Cloud Natural Language API
- D. Dialogflow Enterprise Edition
Answer: D
Explanation:
Dialogflow is used to do voice analytics on human computer interaction.
NEW QUESTION 172
You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?
- A. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
- B. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.
- C. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
- D. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
Answer: B
Explanation:
Explanation
NEW QUESTION 173
You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?
- A. Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
- B. Rewrite your models on TensorFlow, and start using Cloud ML Engine
- C. Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
- D. Use Cloud ML Engine for training existing Spark ML models
Answer: D
Explanation:
Explanation
NEW QUESTION 174
You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?
- A. Write a Cloud Dataflow pipeline that aggregates all data in session windows.
- B. Deploy small Kafka clusters in your data centers to buffer events.
- C. Establish a Cloud Interconnect between all remote data centers and Google.
- D. Have the data acquisition devices publish data to Cloud Pub/Sub.
Answer: D
Explanation:
Pubsub is global service with high message delivery capacity.
NEW QUESTION 175
Your neural network model is taking days to train. You want to increase the training speed. What can you do?
- A. Increase the number of layers in your neural network.
- B. Increase the number of input features to your model.
- C. Subsample your test dataset.
- D. Subsample your training dataset.
Answer: D
Explanation:
Subsampling is the method to increase the training speed.
NEW QUESTION 176
Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use?
(Choose three.)
- A. Supervised learning to determine which transactions are most likely to be fraudulent.
- B. Clustering to divide the transactions into N categories based on feature similarity.
- C. Unsupervised learning to predict the location of a transaction.
- D. Reinforcement learning to predict the location of a transaction.
- E. Supervised learning to predict the location of a transaction.
- F. Unsupervised learning to determine which transactions are most likely to be fraudulent.
Answer: B,D,F
NEW QUESTION 177
Your company needs to upload their historic data to Cloud Storage. The security rules don't allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day. What should they do?
- A. Use Cloud Dataflow and write the data to Cloud Storage.
- B. Execute gsutil rsyncfrom the on-premises servers.
- C. Write a job template in Cloud Dataproc to perform the data transfer.
- D. Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.
Answer: A
NEW QUESTION 178
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- A. Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
- B. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
- C. Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
- D. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
- E. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.
Answer: A
Explanation:
Topic 1, Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
* 8 physical servers in 2 clusters
* SQL Server - user data, inventory, static data
* 3 physical servers
* Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
* 60 virtual machines across 20 physical servers
* Tomcat - Java services
* Nginx - static content
* Batch servers
Storage appliances
* iSCSI for virtual machine (VM) hosts
* Fibre Channel storage area network (FC SAN) - SQL server storage
* Network-attached storage (NAS) image storage, logs, backups
* Apache Hadoop /Spark servers
* Core Data Lake
* Data analysis workloads
* 20 miscellaneous servers
* Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
* Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
NEW QUESTION 179
......
Q&As with Explanations Verified & Correct Answers: https://certification-questions.pdfvce.com/Google/Professional-Data-Engineer-exam-pdf-dumps.html