Collaborating within and across teams to manage data and models - Quiz

Question 1 of 10

Your team is building an ML model to predict customer churn using structured data from multiple sources: transaction logs in BigQuery (10 TB), real-time customer events from Pub/Sub, and historical customer profiles in Cloud SQL (500 GB). Which combination of services would be most efficient for preprocessing this data before training?

Use Dataflow to join Cloud SQL data with BigQuery, then use BigQuery ML for preprocessing and feature engineering Export all data to Cloud Storage, use Dataproc with Spark to join and preprocess, then load results back to BigQuery Use Dataflow to stream Pub/Sub data into BigQuery, replicate Cloud SQL to BigQuery using Datastream, then use BigQuery for preprocessing and feature engineering Use Cloud Functions to export Cloud SQL to Cloud Storage, manually merge with BigQuery exports, then use Vertex AI Workbench for preprocessing

Explanation

Option 3 is correct because it leverages the right tools for each task: Dataflow efficiently handles streaming data from Pub/Sub into BigQuery, Datastream provides continuous replication from Cloud SQL to BigQuery without manual intervention, and BigQuery excels at large-scale data preprocessing with its serverless architecture. This approach keeps all data in BigQuery where it can be efficiently preprocessed at scale. Option 1 doesn't address the real-time Pub/Sub data. Option 2 is unnecessarily complex and introduces latency by moving data out of BigQuery when it's already well-suited for the task. Option 4 involves too many manual steps and doesn't scale well for 10 TB of data.

Question 2 of 10

You need to create a centralized feature store for your organization that will serve features for both batch prediction jobs and low-latency online predictions (< 10ms). The features are computed from various sources including BigQuery tables and Dataflow pipelines. What is the best approach using Vertex AI Feature Store?

Create separate feature stores for batch and online serving, synchronizing them manually via Cloud Scheduler Create a single Vertex AI Feature Store with both online and offline serving enabled, ingesting features from BigQuery and Dataflow into the feature store Store all features only in BigQuery and query directly for both batch and online predictions Use Cloud Bigtable for online serving and BigQuery for batch serving, maintaining separate feature definitions

Explanation

Option 2 is correct because Vertex AI Feature Store is designed to handle both online and offline serving from a single source of truth. It automatically maintains synchronized online and offline stores, supports ingestion from multiple sources like BigQuery and Dataflow, and provides the low-latency access needed for online predictions while also supporting efficient batch retrieval. Option 1 creates unnecessary complexity and risks inconsistency. Option 3 cannot meet the <10ms latency requirement for online predictions as BigQuery is optimized for analytical queries, not low-latency point lookups. Option 4 creates data silos and duplicate feature engineering logic, violating the DRY principle and increasing maintenance burden.

Question 3 of 10

Your healthcare ML project processes patient medical images (DICOM format) stored in Cloud Storage and associated metadata in BigQuery. The metadata contains PHI including patient names and medical record numbers. What is the most appropriate approach to handle this sensitive data before training?

Use Cloud Data Loss Prevention (DLP) API to identify and redact PHI from BigQuery, store de-identified metadata, and use de-identification tags to link with images Encrypt all data using Customer-Managed Encryption Keys (CMEK) and proceed with training on encrypted data Move all PHI data to a separate project with restricted IAM permissions and continue using original identifiers for training Hash all PHI fields using SHA-256 before training to anonymize the data

Explanation

Option 1 is correct because Cloud DLP API is specifically designed to identify, classify, and de-identify sensitive data like PHI. It provides robust de-identification techniques (tokenization, masking, redaction) that comply with HIPAA regulations while maintaining the ability to re-identify data when necessary through secure tokens. Option 2 only addresses data-at-rest security but doesn't remove PHI from the actual training process, so the model could still learn from or memorize sensitive identifiers. Option 3 doesn't remove PHI, just restricts access, which doesn't address the privacy risk during training. Option 4 is insufficient because simple hashing can be reversed with rainbow tables or dictionary attacks, and doesn't address all types of PHI or provide the comprehensive de-identification needed for compliance.

Question 4 of 10

You are preprocessing a dataset of 50 million product images stored in Cloud Storage for a computer vision model. Each image needs to be resized, normalized, and augmented with rotations and color adjustments. Which approach provides the most scalable and cost-effective solution?

Use Vertex AI Workbench with a large instance to process images sequentially using Python PIL library Create a Dataflow pipeline using Apache Beam with custom Python transforms to process images in parallel Use BigQuery ML to preprocess images after loading them into BigQuery tables Deploy a GKE cluster with multiple nodes running custom preprocessing containers

Explanation

Option 2 is correct because Dataflow provides serverless, auto-scaling parallel processing that's ideal for large-scale batch image preprocessing. Apache Beam's Python SDK allows custom transforms for image operations, and Dataflow automatically scales workers based on the workload, making it cost-effective (pay only for resources used). It can read directly from Cloud Storage and write results back. Option 1 doesn't scale well for 50 million images and would be extremely slow and expensive to run a large instance for extended periods. Option 3 is inappropriate as BigQuery is not designed for image processing and has size limitations for individual rows. Option 4 requires manual cluster management, scaling configuration, and is more operationally complex than the fully managed Dataflow option.

Question 5 of 10

Your ML pipeline needs to process streaming text data from customer support chats in real-time, extract features, and make predictions. The chat logs contain PII such as email addresses, phone numbers, and credit card numbers. What is the best architecture to handle this?

Pub/Sub → Dataflow (with DLP API integration for PII removal) → Vertex AI Feature Store → Vertex AI Prediction Pub/Sub → Cloud Functions (for PII removal) → BigQuery → Batch prediction with Vertex AI Pub/Sub → Cloud Storage → Manual review for PII → Vertex AI Prediction Direct streaming from Pub/Sub to Vertex AI Prediction endpoints with post-processing PII removal

Explanation

Option 1 is correct because it creates a complete real-time pipeline: Pub/Sub handles message ingestion, Dataflow processes streams at scale and can integrate with the DLP API to detect and remove PII in real-time, Vertex AI Feature Store provides low-latency feature serving, and Vertex AI Prediction delivers real-time inference. This architecture ensures PII is removed before any processing or storage. Option 2 doesn't meet the real-time requirement as it uses batch prediction. Option 3 introduces manual steps that can't scale for real-time streaming and creates bottlenecks. Option 4 is dangerous as it exposes PII to the prediction service before removal, violating data privacy principles and potentially logging sensitive information.

Question 6 of 10

You need to ingest and preprocess 100 TB of historical transaction logs stored in Apache Parquet format on-premises for a fraud detection model. The data needs to be joined with reference tables in Cloud Spanner. What is the most efficient approach?

Use Storage Transfer Service to move Parquet files to Cloud Storage, load into BigQuery using BigQuery Data Transfer Service, then use federated queries to join with Spanner Use Transfer Appliance to ship data to Google, load into BigQuery, export Spanner tables to BigQuery using Dataflow, then perform joins in BigQuery Stream data using gsutil to Cloud Storage, then use Dataproc to join Parquet files with Spanner data using Spark Use Cloud Storage Transfer Service, convert Parquet to CSV using Cloud Functions, load into Cloud SQL, then join with Spanner

Explanation

Option 1 is incorrect because BigQuery federated queries to Spanner have performance limitations and are not recommended for large-scale joins with 100 TB of data. Option 2 is correct: Transfer Appliance is designed for securely shipping large datasets (40 TB+) when network transfer would be too slow or expensive. BigQuery natively supports Parquet format, making the load efficient. Dataflow can efficiently export Spanner data to BigQuery, and BigQuery excels at large-scale joins. Option 3 uses gsutil which would be extremely slow and expensive for 100 TB over the network. Option 4 has unnecessary conversion steps (Parquet to CSV) and Cloud SQL cannot efficiently handle 100 TB or complex joins with Spanner at this scale.

Question 7 of 10

Your team needs to perform exploratory data analysis on a 5 TB dataset in BigQuery and prototype ML models. Multiple data scientists need to collaborate simultaneously. What is the best approach for setting up the environment?

Create individual Vertex AI Workbench user-managed notebooks instances for each data scientist with BigQuery integration enabled Set up a single shared Vertex AI Workbench instance with JupyterLab and configure IAM permissions for all users Export the data to Cloud Storage and have each data scientist run local Jupyter notebooks Use BigQuery ML exclusively through the BigQuery console without notebooks

Explanation

Option 1 is correct because user-managed Workbench instances provide each data scientist with their own isolated, customizable environment with built-in BigQuery integration. They can query the same BigQuery dataset without moving data, install their own libraries, and save their work independently. Notebooks can be shared via Git or Cloud Storage for collaboration. Option 2 creates conflicts as multiple users can't effectively work simultaneously on the same notebook instance - they would overwrite each other's work and compete for resources. Option 3 is inefficient because it requires exporting 5 TB of data and doesn't provide the scalable compute needed for large datasets - local machines would struggle. Option 4 is too restrictive for exploratory analysis and prototyping, limiting the data scientists' ability to experiment with different approaches and visualizations.

Question 8 of 10

You are building a recommendation model that requires features from user clickstream data (Cloud Storage, 20 TB of JSON files), user profiles (Cloud SQL, 100 GB), and product catalog (Spanner, 500 GB). You need to create a repeatable preprocessing pipeline. What is the recommended approach?

Build a TensorFlow Extended (TFX) pipeline using components like BigQueryExampleGen, Transform, and StatisticsGen after importing all data into BigQuery Use Cloud Composer to orchestrate separate Dataflow jobs for each data source, then manually combine results in Vertex AI Workbench Write custom Python scripts to process each source separately and schedule with Cloud Scheduler Use only BigQuery scripts scheduled with BigQuery's scheduled queries feature

Explanation

Option 1 is correct because TFX provides a production-ready, repeatable framework specifically designed for ML preprocessing pipelines. First consolidating data in BigQuery makes sense as it can efficiently handle the 20 TB of clickstream data and imported SQL/Spanner data. TFX components provide validation, transformation, and statistics generation with built-in versioning and reproducibility. The pipeline can be orchestrated with Vertex Pipelines. Option 2 lacks the ML-specific features of TFX like schema validation, anomaly detection, and feature transformation tracking. Option 3 doesn't provide the robustness, monitoring, or ML-specific capabilities needed for production pipelines and is hard to maintain. Option 4 is limited to BigQuery and doesn't provide the comprehensive ML pipeline features like data validation, feature engineering, and integration with model training that TFX offers.

Question 9 of 10

Your organization has tabular data in Cloud SQL (customer data), time-series data in Bigtable (IoT sensor readings), and text documents in Cloud Storage (customer feedback). You need to create a unified dataset in Vertex AI for training a multimodal model. What is the best strategy?

Create a Vertex AI Managed Dataset by importing CSV exports from Cloud SQL, using Dataflow to export Bigtable to CSV, and directly importing text files from Cloud Storage Use Dataflow to stream all data into BigQuery, create views joining the data, then create a Vertex AI Managed Dataset from BigQuery Export everything to Cloud Storage in TFRecord format using separate pipelines, then create a Vertex AI Managed Dataset pointing to the TFRecords Keep data in original locations and use federated queries from BigQuery to access during training

Explanation

Option 1 has problems because manual CSV exports don't scale well and Bigtable to CSV conversion loses the time-series structure and efficiency. Option 2 is correct because it creates a single source of truth in BigQuery, which can efficiently handle tabular, time-series (with appropriate schema design), and text data. Dataflow can efficiently export Bigtable and Cloud SQL data to BigQuery, and BigQuery can reference text files in Cloud Storage via external tables if needed. Vertex AI Managed Datasets integrate seamlessly with BigQuery. This approach also enables easier feature engineering and preprocessing using BigQuery's SQL and ML functions. Option 3 requires maintaining separate TFRecord generation pipelines and doesn't provide the query flexibility or ease of updates that BigQuery offers. Option 4 would have significant performance issues during training as federated queries add latency and don't provide the consistent performance needed for efficient model training.

Question 10 of 10

You need to track and version multiple preprocessing experiments on a BigQuery dataset where you're testing different feature engineering approaches (scaling methods, encoding strategies, feature selection). What is the most effective way to manage this experimentation in Vertex AI?

Use Vertex AI Experiments to log parameters, metrics, and artifacts for each preprocessing approach, storing preprocessed datasets in Cloud Storage with versioned paths Create separate BigQuery datasets for each experiment and manually track results in a spreadsheet Use Git to version control preprocessing SQL scripts and run them manually, saving results with timestamped table names Create different Vertex AI Managed Datasets for each preprocessing variant without additional tracking

Explanation

Option 1 is correct because Vertex AI Experiments is specifically designed for ML experimentation tracking. It allows you to log preprocessing parameters (e.g., scaling method, encoding type), metrics (e.g., feature statistics, data quality scores), and artifacts (e.g., preprocessing configs, sample outputs). This creates a searchable, comparable record of all experiments. Storing preprocessed datasets in Cloud Storage with versioned paths ensures reproducibility. The Vertex AI SDK integrates with notebooks for easy tracking. Option 2 is error-prone, doesn't scale, and makes comparison difficult. Option 3 only versions the code, not the actual parameters used, results, or data lineage, making it hard to reproduce or compare experiments. Option 4 creates dataset variants but provides no systematic way to track what preprocessing was applied, what parameters were used, or how to compare results across experiments.

Welcome Back

Terms of Service & Privacy Notice

Create Account

Verify Your Phone Number

Reset Your Password

Login Required

Get in Touch

Exploring and preprocessing organization-wide data (e.g., Cloud Storage, BigQuery, Spanner, Cloud SQL, Apache Spark, Apache Hadoop)