⚡ Free Classes and Scholarships Available for Underprivileged Students -

Exploring and preprocessing organization-wide data (e.g., Cloud Storage, BigQuery, Spanner, Cloud SQL, Apache Spark, Apache Hadoop)

Collaborating within and across teams to manage data and models

10 Questions
No time limit
Practice Mode
0%
Score
0
Correct
0
Incorrect
10
Total Questions
Back to Topics
Question 1 of 10
Your organization stores 500TB of time-series sensor data in Cloud Storage that needs to be preprocessed for training a predictive maintenance model. The preprocessing involves complex aggregations, window functions, and feature engineering. Which approach provides the most cost-effective and scalable solution?
Explanation
Loading data into BigQuery and using SQL with materialized views is the most cost-effective solution for this scale because: (1) BigQuery is optimized for large-scale analytics with petabyte-scale capabilities, (2) SQL window functions and aggregations are native and highly optimized, (3) Materialized views cache results reducing repeated computation costs, and (4) serverless architecture means no infrastructure management. Dataflow (option B) would work but incurs higher costs for this use case as the transformations can be done efficiently in SQL. Vertex AI Workbench (option C) doesn't scale well for 500TB and would be extremely expensive. Cloud Functions (option D) is not designed for large-scale batch processing and would be inefficient for this volume.
Question 2 of 10
You need to create a centralized feature store for multiple ML teams working on customer recommendation models. Features are derived from real-time transaction data in Cloud Spanner and batch data in BigQuery. What is the best approach to implement this in Vertex AI Feature Store?
Explanation
Option B is correct because Vertex AI Feature Store is specifically designed for this use case: (1) A single feature store enables feature sharing across teams and ensures consistency, (2) Multiple entity types allow organizing features by different business entities (customers, products, etc.), (3) BigQuery integration supports batch ingestion through scheduled jobs, (4) Dataflow can stream features from Spanner for real-time updates, and (5) Feature Store provides versioning, point-in-time lookup, and low-latency serving. Option A creates silos and defeats the purpose of centralization. Option C requires custom development and maintenance without the benefits of managed service. Option D doesn't provide the optimized serving layer needed for low-latency predictions and lacks feature versioning.
Question 3 of 10
Your team is preprocessing medical imaging data containing PHI for a diagnostic ML model. The data must be de-identified before being used by the data science team. Which combination of Google Cloud services ensures proper anonymization while maintaining data utility?
Explanation
Option B is the correct comprehensive approach because: (1) Cloud DLP API is specifically designed to detect and redact sensitive data including PHI, (2) It handles metadata de-identification systematically, (3) Custom transformations can remove embedded text from images (like patient names burned into scans), (4) Storing in a separate project provides organizational isolation, (5) IAM controls ensure proper access governance. Option A is manual, error-prone, and doesn't scale. Option C (encryption) protects data in transit/at rest but doesn't de-identify it - authorized users still see PHI. Option D (shuffling) doesn't remove PHI, it just reorders data, which doesn't meet de-identification requirements under HIPAA or similar regulations.
Question 4 of 10
You are building a text classification model and need to preprocess 10 million text documents stored across multiple Cloud Storage buckets. The preprocessing includes tokenization, lemmatization, and TF-IDF calculation. Which approach best leverages TensorFlow Extended (TFX) components?
Explanation
Option B represents the proper TFX pipeline approach: (1) ExampleGen ingests data from Cloud Storage into TFX format, (2) StatisticsGen computes statistics crucial for understanding data distribution and vocabulary, (3) SchemaGen creates a schema that defines expected data structure and types, (4) Transform component with tf.transform performs preprocessing that's consistent between training and serving, ensuring no training-serving skew. This creates a reproducible, production-ready pipeline. Option A skips critical validation and schema steps. Option C bypasses TFX entirely, losing pipeline orchestration, versioning, and training-serving consistency benefits. Option D is limited to BigQuery-resident data and doesn't provide the full ML pipeline infrastructure that TFX offers.
Question 5 of 10
Your organization has transactional data in Cloud SQL (PostgreSQL) that needs to be joined with historical data in BigQuery for feature engineering. The joined dataset will be used to train a model in Vertex AI. What is the most efficient approach?
Explanation
Option B is most efficient because: (1) BigQuery federated queries using EXTERNAL_QUERY can directly query Cloud SQL without data movement, (2) Joins happen in BigQuery's optimized engine, (3) Materialized results can be cached for repeated access, (4) BigQuery integrates directly with Vertex AI for training, eliminating additional export steps, (5) No intermediate storage or data duplication needed. Option A involves unnecessary data movement and multiple steps. Option C is overly complex with replication overhead. Option D doesn't scale well - pandas cannot efficiently handle large joins and requires significant memory, plus you lose BigQuery's optimization capabilities. Federated queries minimize data movement while leveraging BigQuery's processing power.
Question 6 of 10
You need to ingest real-time clickstream data from Apache Kafka for immediate feature calculation and storage in Vertex AI Feature Store. The features must be available for online predictions within seconds. Which architecture provides the lowest latency?
Explanation
Option B provides the lowest latency because: (1) Dataflow supports streaming mode with sub-second processing latency, (2) Direct integration with Kafka source allows real-time consumption, (3) Vertex AI Feature Store streaming ingestion API enables immediate feature availability, (4) No intermediate storage delays, (5) Dataflow automatically scales to handle throughput variations. Option A has 5-minute batching delay which violates the seconds requirement. Option C adds unnecessary hops through Cloud Functions and BigQuery, increasing latency. Option D involves writing to Cloud Storage before Feature Store, adding I/O delays and typically using micro-batches rather than true streaming. For real-time ML serving, streaming Dataflow to Feature Store is the optimal pattern.
Question 7 of 10
Your team manages multiple datasets in Vertex AI for different model versions. You need to ensure reproducibility and track which dataset version was used for each model training run. What is the recommended approach?
Explanation
Option B is the recommended approach because: (1) Vertex AI Managed Datasets provide built-in versioning and lineage tracking, (2) Dataset resource names uniquely identify specific dataset versions, (3) Vertex AI Training automatically records dataset references in job metadata, creating an audit trail, (4) This integration enables querying which data was used for any model, (5) Supports ML Metadata for full lineage tracking from data to deployed model. Option A is manual, error-prone, and doesn't integrate with ML workflows. Option C lacks systematic versioning and requires manual correlation. Option D is a custom solution that doesn't leverage platform capabilities and adds operational overhead. Vertex AI's native dataset management provides governance and reproducibility out-of-the-box.
Question 8 of 10
You are preprocessing image data stored in Cloud Storage for a computer vision model. The dataset contains 2 million images (5TB total). You need to perform augmentation (rotation, flipping, color adjustment) and resize operations. Which solution optimizes cost and performance?
Explanation
Option B is optimal because: (1) Dataflow automatically scales workers to process 2 million images in parallel, significantly reducing processing time, (2) Processes data where it lives (Cloud Storage) without egress costs, (3) TensorFlow transformations within Dataflow are efficient and can leverage GPUs if needed, (4) Writing to TFRecord format is optimal for TensorFlow/Keras training, (5) Managed service eliminates infrastructure concerns. Option A involves massive data egress costs and limited local processing capacity. Option C doesn't scale - even large instances struggle with 5TB, processing would be slow and expensive. Option D (Cloud Functions) has execution time limits, is designed for small tasks, and would be extremely slow and costly for 2 million images. Dataflow is purpose-built for large-scale data preprocessing.
Question 9 of 10
Your organization uses Apache Hadoop clusters on-premises for data preprocessing. You need to migrate ML workloads to Google Cloud while minimizing code changes. The preprocessing involves complex MapReduce jobs and Hive queries. What migration strategy is most appropriate?
Explanation
Option B is the most appropriate migration strategy because: (1) Dataproc is API-compatible with Hadoop/Spark, allowing existing jobs to run with minimal changes, (2) Can use Cloud Storage as HDFS replacement via Cloud Storage connector, (3) Provides a bridge strategy - run existing workloads while gradually refactoring to cloud-native solutions, (4) Offers autoscaling and ephemeral clusters for cost optimization, (5) Reduces migration risk and timeline. Option A requires complete rewrites, high effort and risk. Option C also requires significant rewriting - while BigQuery is powerful, immediate full migration is risky for complex pipelines. Option D provides no cloud benefits (no autoscaling, full operational burden, no managed services), essentially running on-premises architecture in cloud. Lift-and-shift to Dataproc, then modernize, is the recommended pattern.
Question 10 of 10
You need to serve inference requests for a text model that requires loading large document embeddings stored in Cloud Spanner. Predictions must have less than 100ms latency. The model is deployed on Vertex AI Prediction. How should you optimize data access?
Explanation
Option C is correct for meeting the 100ms latency requirement because: (1) Memorystore for Redis provides sub-millisecond latency, fast enough for real-time serving, (2) Spanner queries typically take 5-50ms, which combined with model inference might exceed 100ms, (3) Redis acts as a cache layer while Spanner maintains data consistency as source of truth, (4) Supports high QPS (queries per second) needed for prediction serving, (5) Can implement cache-aside pattern with TTL for freshness. Option A (direct Spanner) might not consistently meet 100ms SLA for complex queries. Option B isn't feasible if embeddings are too large for memory or change frequently. Option D (BigQuery) is designed for analytics, not low-latency serving - queries typically take seconds. For latency-sensitive ML serving, a caching layer like Memorystore is essential.