⚡ Free Classes and Scholarships Available for Underprivileged Students -

Exploring and preprocessing organization-wide data (e.g., Cloud Storage, BigQuery, Spanner, Cloud SQL, Apache Spark, Apache Hadoop)

Collaborating within and across teams to manage data and models

10 Questions
No time limit
Practice Mode
0%
Score
0
Correct
0
Incorrect
10
Total Questions
Back to Topics
Question 1 of 10
Your team is building an ML model to predict customer churn using structured data from multiple sources: transaction logs in BigQuery (10 TB), real-time customer events from Pub/Sub, and historical customer profiles in Cloud SQL (500 GB). Which combination of services would be most efficient for preprocessing this data before training?
Explanation
Option 3 is correct because it leverages the right tools for each task: Dataflow efficiently handles streaming data from Pub/Sub into BigQuery, Datastream provides continuous replication from Cloud SQL to BigQuery without manual intervention, and BigQuery excels at large-scale data preprocessing with its serverless architecture. This approach keeps all data in BigQuery where it can be efficiently preprocessed at scale. Option 1 doesn't address the real-time Pub/Sub data. Option 2 is unnecessarily complex and introduces latency by moving data out of BigQuery when it's already well-suited for the task. Option 4 involves too many manual steps and doesn't scale well for 10 TB of data.
Question 2 of 10
You need to create a centralized feature store for your organization that will serve features for both batch prediction jobs and low-latency online predictions (< 10ms). The features are computed from various sources including BigQuery tables and Dataflow pipelines. What is the best approach using Vertex AI Feature Store?
Explanation
Option 2 is correct because Vertex AI Feature Store is designed to handle both online and offline serving from a single source of truth. It automatically maintains synchronized online and offline stores, supports ingestion from multiple sources like BigQuery and Dataflow, and provides the low-latency access needed for online predictions while also supporting efficient batch retrieval. Option 1 creates unnecessary complexity and risks inconsistency. Option 3 cannot meet the <10ms latency requirement for online predictions as BigQuery is optimized for analytical queries, not low-latency point lookups. Option 4 creates data silos and duplicate feature engineering logic, violating the DRY principle and increasing maintenance burden.
Question 3 of 10
Your healthcare ML project processes patient medical images (DICOM format) stored in Cloud Storage and associated metadata in BigQuery. The metadata contains PHI including patient names and medical record numbers. What is the most appropriate approach to handle this sensitive data before training?
Explanation
Option 1 is correct because Cloud DLP API is specifically designed to identify, classify, and de-identify sensitive data like PHI. It provides robust de-identification techniques (tokenization, masking, redaction) that comply with HIPAA regulations while maintaining the ability to re-identify data when necessary through secure tokens. Option 2 only addresses data-at-rest security but doesn't remove PHI from the actual training process, so the model could still learn from or memorize sensitive identifiers. Option 3 doesn't remove PHI, just restricts access, which doesn't address the privacy risk during training. Option 4 is insufficient because simple hashing can be reversed with rainbow tables or dictionary attacks, and doesn't address all types of PHI or provide the comprehensive de-identification needed for compliance.
Question 4 of 10
You are preprocessing a dataset of 50 million product images stored in Cloud Storage for a computer vision model. Each image needs to be resized, normalized, and augmented with rotations and color adjustments. Which approach provides the most scalable and cost-effective solution?
Explanation
Option 2 is correct because Dataflow provides serverless, auto-scaling parallel processing that's ideal for large-scale batch image preprocessing. Apache Beam's Python SDK allows custom transforms for image operations, and Dataflow automatically scales workers based on the workload, making it cost-effective (pay only for resources used). It can read directly from Cloud Storage and write results back. Option 1 doesn't scale well for 50 million images and would be extremely slow and expensive to run a large instance for extended periods. Option 3 is inappropriate as BigQuery is not designed for image processing and has size limitations for individual rows. Option 4 requires manual cluster management, scaling configuration, and is more operationally complex than the fully managed Dataflow option.
Question 5 of 10
Your ML pipeline needs to process streaming text data from customer support chats in real-time, extract features, and make predictions. The chat logs contain PII such as email addresses, phone numbers, and credit card numbers. What is the best architecture to handle this?
Explanation
Option 1 is correct because it creates a complete real-time pipeline: Pub/Sub handles message ingestion, Dataflow processes streams at scale and can integrate with the DLP API to detect and remove PII in real-time, Vertex AI Feature Store provides low-latency feature serving, and Vertex AI Prediction delivers real-time inference. This architecture ensures PII is removed before any processing or storage. Option 2 doesn't meet the real-time requirement as it uses batch prediction. Option 3 introduces manual steps that can't scale for real-time streaming and creates bottlenecks. Option 4 is dangerous as it exposes PII to the prediction service before removal, violating data privacy principles and potentially logging sensitive information.
Question 6 of 10
You need to ingest and preprocess 100 TB of historical transaction logs stored in Apache Parquet format on-premises for a fraud detection model. The data needs to be joined with reference tables in Cloud Spanner. What is the most efficient approach?
Explanation
Option 1 is incorrect because BigQuery federated queries to Spanner have performance limitations and are not recommended for large-scale joins with 100 TB of data. Option 2 is correct: Transfer Appliance is designed for securely shipping large datasets (40 TB+) when network transfer would be too slow or expensive. BigQuery natively supports Parquet format, making the load efficient. Dataflow can efficiently export Spanner data to BigQuery, and BigQuery excels at large-scale joins. Option 3 uses gsutil which would be extremely slow and expensive for 100 TB over the network. Option 4 has unnecessary conversion steps (Parquet to CSV) and Cloud SQL cannot efficiently handle 100 TB or complex joins with Spanner at this scale.
Question 7 of 10
Your team needs to perform exploratory data analysis on a 5 TB dataset in BigQuery and prototype ML models. Multiple data scientists need to collaborate simultaneously. What is the best approach for setting up the environment?
Explanation
Option 1 is correct because user-managed Workbench instances provide each data scientist with their own isolated, customizable environment with built-in BigQuery integration. They can query the same BigQuery dataset without moving data, install their own libraries, and save their work independently. Notebooks can be shared via Git or Cloud Storage for collaboration. Option 2 creates conflicts as multiple users can't effectively work simultaneously on the same notebook instance - they would overwrite each other's work and compete for resources. Option 3 is inefficient because it requires exporting 5 TB of data and doesn't provide the scalable compute needed for large datasets - local machines would struggle. Option 4 is too restrictive for exploratory analysis and prototyping, limiting the data scientists' ability to experiment with different approaches and visualizations.
Question 8 of 10
You are building a recommendation model that requires features from user clickstream data (Cloud Storage, 20 TB of JSON files), user profiles (Cloud SQL, 100 GB), and product catalog (Spanner, 500 GB). You need to create a repeatable preprocessing pipeline. What is the recommended approach?
Explanation
Option 1 is correct because TFX provides a production-ready, repeatable framework specifically designed for ML preprocessing pipelines. First consolidating data in BigQuery makes sense as it can efficiently handle the 20 TB of clickstream data and imported SQL/Spanner data. TFX components provide validation, transformation, and statistics generation with built-in versioning and reproducibility. The pipeline can be orchestrated with Vertex Pipelines. Option 2 lacks the ML-specific features of TFX like schema validation, anomaly detection, and feature transformation tracking. Option 3 doesn't provide the robustness, monitoring, or ML-specific capabilities needed for production pipelines and is hard to maintain. Option 4 is limited to BigQuery and doesn't provide the comprehensive ML pipeline features like data validation, feature engineering, and integration with model training that TFX offers.
Question 9 of 10
Your organization has tabular data in Cloud SQL (customer data), time-series data in Bigtable (IoT sensor readings), and text documents in Cloud Storage (customer feedback). You need to create a unified dataset in Vertex AI for training a multimodal model. What is the best strategy?
Explanation
Option 1 has problems because manual CSV exports don't scale well and Bigtable to CSV conversion loses the time-series structure and efficiency. Option 2 is correct because it creates a single source of truth in BigQuery, which can efficiently handle tabular, time-series (with appropriate schema design), and text data. Dataflow can efficiently export Bigtable and Cloud SQL data to BigQuery, and BigQuery can reference text files in Cloud Storage via external tables if needed. Vertex AI Managed Datasets integrate seamlessly with BigQuery. This approach also enables easier feature engineering and preprocessing using BigQuery's SQL and ML functions. Option 3 requires maintaining separate TFRecord generation pipelines and doesn't provide the query flexibility or ease of updates that BigQuery offers. Option 4 would have significant performance issues during training as federated queries add latency and don't provide the consistent performance needed for efficient model training.
Question 10 of 10
You need to track and version multiple preprocessing experiments on a BigQuery dataset where you're testing different feature engineering approaches (scaling methods, encoding strategies, feature selection). What is the most effective way to manage this experimentation in Vertex AI?
Explanation
Option 1 is correct because Vertex AI Experiments is specifically designed for ML experimentation tracking. It allows you to log preprocessing parameters (e.g., scaling method, encoding type), metrics (e.g., feature statistics, data quality scores), and artifacts (e.g., preprocessing configs, sample outputs). This creates a searchable, comparable record of all experiments. Storing preprocessed datasets in Cloud Storage with versioned paths ensures reproducibility. The Vertex AI SDK integrates with notebooks for easy tracking. Option 2 is error-prone, doesn't scale, and makes comparison difficult. Option 3 only versions the code, not the actual parameters used, results, or data lineage, making it hard to reproduce or compare experiments. Option 4 creates dataset variants but provides no systematic way to track what preprocessing was applied, what parameters were used, or how to compare results across experiments.