Collaborating within and across teams to manage data and models - Quiz

Question 1 of 10

Your organization stores 500TB of time-series sensor data in Cloud Storage that needs to be preprocessed for training a predictive maintenance model. The preprocessing involves complex aggregations, window functions, and feature engineering. Which approach provides the most cost-effective and scalable solution?

Load the data into BigQuery, create external tables, and use SQL for preprocessing with materialized views Use Dataflow with Apache Beam to read from Cloud Storage, perform transformations, and write to BigQuery Create a Vertex AI Workbench instance with high memory and process the data using pandas Use Cloud Functions triggered by Cloud Storage events to process files individually

Explanation

Loading data into BigQuery and using SQL with materialized views is the most cost-effective solution for this scale because: (1) BigQuery is optimized for large-scale analytics with petabyte-scale capabilities, (2) SQL window functions and aggregations are native and highly optimized, (3) Materialized views cache results reducing repeated computation costs, and (4) serverless architecture means no infrastructure management. Dataflow (option B) would work but incurs higher costs for this use case as the transformations can be done efficiently in SQL. Vertex AI Workbench (option C) doesn't scale well for 500TB and would be extremely expensive. Cloud Functions (option D) is not designed for large-scale batch processing and would be inefficient for this volume.

Question 2 of 10

You need to create a centralized feature store for multiple ML teams working on customer recommendation models. Features are derived from real-time transaction data in Cloud Spanner and batch data in BigQuery. What is the best approach to implement this in Vertex AI Feature Store?

Create separate feature stores for each team and use Cloud Storage as an intermediary Set up a single Vertex AI Feature Store with multiple entity types, ingest batch features from BigQuery using scheduled jobs, and stream features from Spanner using Dataflow Export all data to Cloud SQL and create a custom feature serving API using Cloud Run Store all features in BigQuery and query them directly during model training and serving

Explanation

Option B is correct because Vertex AI Feature Store is specifically designed for this use case: (1) A single feature store enables feature sharing across teams and ensures consistency, (2) Multiple entity types allow organizing features by different business entities (customers, products, etc.), (3) BigQuery integration supports batch ingestion through scheduled jobs, (4) Dataflow can stream features from Spanner for real-time updates, and (5) Feature Store provides versioning, point-in-time lookup, and low-latency serving. Option A creates silos and defeats the purpose of centralization. Option C requires custom development and maintenance without the benefits of managed service. Option D doesn't provide the optimized serving layer needed for low-latency predictions and lacks feature versioning.

Question 3 of 10

Your team is preprocessing medical imaging data containing PHI for a diagnostic ML model. The data must be de-identified before being used by the data science team. Which combination of Google Cloud services ensures proper anonymization while maintaining data utility?

Use Cloud Vision API to detect text in images, then manually review and redact PHI Apply Cloud DLP API to detect and redact PHI from metadata, use custom transformation to remove embedded text from images, and store anonymized data in a separate project with IAM controls Encrypt all data using Cloud KMS and provide keys only to authorized users Use Dataflow to shuffle and randomize the dataset order before sharing

Explanation

Option B is the correct comprehensive approach because: (1) Cloud DLP API is specifically designed to detect and redact sensitive data including PHI, (2) It handles metadata de-identification systematically, (3) Custom transformations can remove embedded text from images (like patient names burned into scans), (4) Storing in a separate project provides organizational isolation, (5) IAM controls ensure proper access governance. Option A is manual, error-prone, and doesn't scale. Option C (encryption) protects data in transit/at rest but doesn't de-identify it - authorized users still see PHI. Option D (shuffling) doesn't remove PHI, it just reorders data, which doesn't meet de-identification requirements under HIPAA or similar regulations.

Question 4 of 10

You are building a text classification model and need to preprocess 10 million text documents stored across multiple Cloud Storage buckets. The preprocessing includes tokenization, lemmatization, and TF-IDF calculation. Which approach best leverages TensorFlow Extended (TFX) components?

Use only the Transform component with a preprocessing_fn that handles all text operations Use ExampleGen to ingest data, StatisticsGen to analyze, SchemaGen to define schema, and Transform component with tf.transform for preprocessing Skip TFX and use a custom Python script with NLTK library running on Compute Engine Use only BigQuery ML built-in text preprocessing functions

Explanation

Option B represents the proper TFX pipeline approach: (1) ExampleGen ingests data from Cloud Storage into TFX format, (2) StatisticsGen computes statistics crucial for understanding data distribution and vocabulary, (3) SchemaGen creates a schema that defines expected data structure and types, (4) Transform component with tf.transform performs preprocessing that's consistent between training and serving, ensuring no training-serving skew. This creates a reproducible, production-ready pipeline. Option A skips critical validation and schema steps. Option C bypasses TFX entirely, losing pipeline orchestration, versioning, and training-serving consistency benefits. Option D is limited to BigQuery-resident data and doesn't provide the full ML pipeline infrastructure that TFX offers.

Question 5 of 10

Your organization has transactional data in Cloud SQL (PostgreSQL) that needs to be joined with historical data in BigQuery for feature engineering. The joined dataset will be used to train a model in Vertex AI. What is the most efficient approach?

Export Cloud SQL data to Cloud Storage, then load into BigQuery using bq load, perform joins in BigQuery, and export to Cloud Storage for Vertex AI training Create a federated query in BigQuery using EXTERNAL_QUERY to join Cloud SQL data with BigQuery tables, materialize results, and use BigQuery as the training data source Replicate Cloud SQL to a read replica, export to CSV files, and process using Dataflow Use Vertex AI Workbench to read from both sources using Python clients and join using pandas

Explanation

Option B is most efficient because: (1) BigQuery federated queries using EXTERNAL_QUERY can directly query Cloud SQL without data movement, (2) Joins happen in BigQuery's optimized engine, (3) Materialized results can be cached for repeated access, (4) BigQuery integrates directly with Vertex AI for training, eliminating additional export steps, (5) No intermediate storage or data duplication needed. Option A involves unnecessary data movement and multiple steps. Option C is overly complex with replication overhead. Option D doesn't scale well - pandas cannot efficiently handle large joins and requires significant memory, plus you lose BigQuery's optimization capabilities. Federated queries minimize data movement while leveraging BigQuery's processing power.

Question 6 of 10

You need to ingest real-time clickstream data from Apache Kafka for immediate feature calculation and storage in Vertex AI Feature Store. The features must be available for online predictions within seconds. Which architecture provides the lowest latency?

Kafka → Pub/Sub → Cloud Storage → Batch import to Feature Store every 5 minutes Kafka → Dataflow (streaming) → Vertex AI Feature Store using streaming ingestion API Kafka → Cloud Functions → BigQuery → Feature Store batch sync Kafka → Dataproc Spark Streaming → Cloud Storage → Feature Store

Explanation

Option B provides the lowest latency because: (1) Dataflow supports streaming mode with sub-second processing latency, (2) Direct integration with Kafka source allows real-time consumption, (3) Vertex AI Feature Store streaming ingestion API enables immediate feature availability, (4) No intermediate storage delays, (5) Dataflow automatically scales to handle throughput variations. Option A has 5-minute batching delay which violates the seconds requirement. Option C adds unnecessary hops through Cloud Functions and BigQuery, increasing latency. Option D involves writing to Cloud Storage before Feature Store, adding I/O delays and typically using micro-batches rather than true streaming. For real-time ML serving, streaming Dataflow to Feature Store is the optimal pattern.

Question 7 of 10

Your team manages multiple datasets in Vertex AI for different model versions. You need to ensure reproducibility and track which dataset version was used for each model training run. What is the recommended approach?

Manually document dataset locations in a shared Google Sheet Use Vertex AI Managed Datasets with versioning, and reference dataset resource names in Vertex AI Training job metadata Copy datasets to dated Cloud Storage folders and include the date in model artifact names Store dataset hash values in Cloud SQL and query before each training run

Explanation

Option B is the recommended approach because: (1) Vertex AI Managed Datasets provide built-in versioning and lineage tracking, (2) Dataset resource names uniquely identify specific dataset versions, (3) Vertex AI Training automatically records dataset references in job metadata, creating an audit trail, (4) This integration enables querying which data was used for any model, (5) Supports ML Metadata for full lineage tracking from data to deployed model. Option A is manual, error-prone, and doesn't integrate with ML workflows. Option C lacks systematic versioning and requires manual correlation. Option D is a custom solution that doesn't leverage platform capabilities and adds operational overhead. Vertex AI's native dataset management provides governance and reproducibility out-of-the-box.

Question 8 of 10

You are preprocessing image data stored in Cloud Storage for a computer vision model. The dataset contains 2 million images (5TB total). You need to perform augmentation (rotation, flipping, color adjustment) and resize operations. Which solution optimizes cost and performance?

Download all images to a local workstation and process using OpenCV Use Dataflow with custom Python functions to read images from Cloud Storage, apply transformations using TensorFlow, and write TFRecord files back to Cloud Storage Create a large Vertex AI Workbench instance and process all images using PIL library in a single notebook Use Cloud Functions to process each image individually as they are uploaded

Explanation

Option B is optimal because: (1) Dataflow automatically scales workers to process 2 million images in parallel, significantly reducing processing time, (2) Processes data where it lives (Cloud Storage) without egress costs, (3) TensorFlow transformations within Dataflow are efficient and can leverage GPUs if needed, (4) Writing to TFRecord format is optimal for TensorFlow/Keras training, (5) Managed service eliminates infrastructure concerns. Option A involves massive data egress costs and limited local processing capacity. Option C doesn't scale - even large instances struggle with 5TB, processing would be slow and expensive. Option D (Cloud Functions) has execution time limits, is designed for small tasks, and would be extremely slow and costly for 2 million images. Dataflow is purpose-built for large-scale data preprocessing.

Question 9 of 10

Your organization uses Apache Hadoop clusters on-premises for data preprocessing. You need to migrate ML workloads to Google Cloud while minimizing code changes. The preprocessing involves complex MapReduce jobs and Hive queries. What migration strategy is most appropriate?

Rewrite all MapReduce jobs as Dataflow pipelines using Apache Beam Use Dataproc to run existing Hadoop and Hive jobs, store data in Cloud Storage, and gradually migrate to cloud-native solutions Migrate all data to BigQuery and rewrite Hive queries as SQL Deploy Hadoop clusters on Compute Engine VMs manually

Explanation

Option B is the most appropriate migration strategy because: (1) Dataproc is API-compatible with Hadoop/Spark, allowing existing jobs to run with minimal changes, (2) Can use Cloud Storage as HDFS replacement via Cloud Storage connector, (3) Provides a bridge strategy - run existing workloads while gradually refactoring to cloud-native solutions, (4) Offers autoscaling and ephemeral clusters for cost optimization, (5) Reduces migration risk and timeline. Option A requires complete rewrites, high effort and risk. Option C also requires significant rewriting - while BigQuery is powerful, immediate full migration is risky for complex pipelines. Option D provides no cloud benefits (no autoscaling, full operational burden, no managed services), essentially running on-premises architecture in cloud. Lift-and-shift to Dataproc, then modernize, is the recommended pattern.

Question 10 of 10

You need to serve inference requests for a text model that requires loading large document embeddings stored in Cloud Spanner. Predictions must have less than 100ms latency. The model is deployed on Vertex AI Prediction. How should you optimize data access?

Query Cloud Spanner directly from the prediction container using the Spanner client library with connection pooling Pre-load all embeddings into the prediction container's memory during initialization Cache embeddings in Memorystore for Redis and query from the prediction container, with Cloud Spanner as the source of truth Export embeddings to BigQuery and query using BigQuery API during prediction

Explanation

Option C is correct for meeting the 100ms latency requirement because: (1) Memorystore for Redis provides sub-millisecond latency, fast enough for real-time serving, (2) Spanner queries typically take 5-50ms, which combined with model inference might exceed 100ms, (3) Redis acts as a cache layer while Spanner maintains data consistency as source of truth, (4) Supports high QPS (queries per second) needed for prediction serving, (5) Can implement cache-aside pattern with TTL for freshness. Option A (direct Spanner) might not consistently meet 100ms SLA for complex queries. Option B isn't feasible if embeddings are too large for memory or change frequently. Option D (BigQuery) is designed for analytics, not low-latency serving - queries typically take seconds. For latency-sensitive ML serving, a caching layer like Memorystore is essential.

Welcome Back

Terms of Service & Privacy Notice

Create Account

Verify Your Phone Number

Reset Your Password

Login Required

Get in Touch

Exploring and preprocessing organization-wide data (e.g., Cloud Storage, BigQuery, Spanner, Cloud SQL, Apache Spark, Apache Hadoop)