Collaborating within and across teams to manage data and models - Quiz

Question 1 of 10

Your organization stores 500 TB of structured transaction data in BigQuery that needs to be preprocessed for a fraud detection model. The preprocessing involves complex windowing operations calculating rolling statistics over 90-day periods. Which approach would be MOST cost-effective and performant?

Export data to Cloud Storage, use Dataflow with Apache Beam windowing operations, then load back to BigQuery Use BigQuery SQL with window functions (OVER clause) to calculate rolling statistics directly in BigQuery Export to Cloud Storage and use Dataproc with Apache Spark for windowing calculations Use Vertex AI Workbench to load data in chunks and process with pandas rolling windows

Explanation

BigQuery SQL with window functions is the most cost-effective solution because: 1) Data stays in BigQuery eliminating expensive data movement, 2) BigQuery's columnar storage and distributed architecture is optimized for analytical queries with window functions, 3) No additional compute resources needed. Option A incurs unnecessary data egress costs and Dataflow compute costs. Option C requires a Dataproc cluster which is more expensive for this use case. Option D doesn't scale well for 500 TB and would require complex chunking logic with potential memory issues.

Question 2 of 10

You need to create a unified feature store for multiple ML teams working on customer segmentation models. Source data includes real-time clickstream events (Cloud Pub/Sub), historical purchase data (BigQuery), and customer profiles (Cloud SQL). What is the BEST approach to consolidate these features in Vertex AI Feature Store?

Use Dataflow to stream data from all sources directly to Vertex AI Feature Store in real-time Create a batch ingestion pipeline using Cloud Composer to periodically sync all sources to Feature Store Use Dataflow for streaming Pub/Sub data to Feature Store, and create batch ingestion jobs for BigQuery and Cloud SQL using the Feature Store SDK Export all data to Cloud Storage first, then use a single Dataflow pipeline to batch load into Feature Store

Explanation

Option C is correct because it follows best practices for hybrid ingestion patterns: streaming sources (Pub/Sub) should use streaming pipelines to maintain low latency, while batch sources (BigQuery, Cloud SQL) can use scheduled batch ingestion which is more cost-effective. Option A doesn't efficiently handle batch sources. Option B processes streaming data in batches, losing real-time benefits. Option D adds unnecessary intermediate storage and doesn't leverage streaming capabilities for real-time data.

Question 3 of 10

Your healthcare ML project needs to preprocess patient data containing PHI from Cloud SQL before training. The data must be de-identified to comply with HIPAA. Which combination of GCP services provides the MOST comprehensive de-identification solution?

Use Cloud SQL export to Cloud Storage, then apply custom de-identification scripts in Dataflow Use Cloud DLP API to inspect and de-identify PHI, then store de-identified data in BigQuery for preprocessing Export to BigQuery and use SQL HASH functions to anonymize sensitive columns Use Vertex AI Workbench with custom Python scripts to mask PHI fields

Explanation

Cloud DLP (Data Loss Prevention) API is the correct choice because: 1) It's specifically designed for detecting and de-identifying sensitive data including PHI, 2) Provides 50+ built-in infoType detectors for PII/PHI, 3) Offers various de-identification methods (masking, tokenization, crypto-based hashing, date shifting), 4) Helps maintain HIPAA compliance with audit trails. Option A lacks comprehensive detection capabilities. Option C using simple HASH functions doesn't provide proper de-identification techniques like format-preserving encryption or k-anonymity. Option D requires manual implementation and lacks the robustness of Cloud DLP.

Question 4 of 10

You're preprocessing 10 PB of unstructured image data stored in Cloud Storage for a computer vision model. The preprocessing includes resizing, normalization, and augmentation. Training will use Vertex AI with TPUs. What is the MOST efficient data organization strategy?

Keep images as individual JPEG files organized in folders by class, use tf.data.Dataset with prefetch and parallel reads Convert all images to TFRecord format with multiple examples per file, store in Cloud Storage, use tf.data.TFRecordDataset Load all images into BigQuery as BYTES, use BigQuery ML for preprocessing Store images in Cloud Filestore for low-latency access during training

Explanation

TFRecord format (Option B) is optimal for large-scale image training because: 1) Sequential reads are more efficient than random access to millions of small files, 2) Reduces storage and I/O overhead, 3) Better integration with TPUs and tf.data pipeline, 4) Enables efficient distributed training with sharding. Option A suffers from small-file problem causing excessive metadata operations. Option C is not designed for binary image data at this scale. Option D (Filestore) is expensive and unnecessary when Cloud Storage with TFRecords provides sufficient throughput for TPU training.

Question 5 of 10

Your team needs to perform exploratory data analysis on 2 TB of JSON log files stored in Cloud Storage before building a text classification model. The analysis requires interactive querying and visualization. Which approach minimizes setup time and cost?

Create an external table in BigQuery pointing to the JSON files in Cloud Storage, query directly without loading Set up a Dataproc cluster, load data into HDFS, and use PySpark notebooks Load all JSON files into a Cloud SQL PostgreSQL instance and use SQL queries Download data to a Vertex AI Workbench instance and use pandas for analysis

Explanation

BigQuery external tables (Option A) are ideal for this scenario because: 1) No data movement or loading time required, 2) Zero storage cost in BigQuery, 3) Automatic JSON schema detection, 4) Serverless with pay-per-query pricing, 5) Can directly visualize results in Looker Studio or Data Studio. Option B requires cluster provisioning and management overhead. Option C is not designed for 2TB of log data and would be expensive. Option D doesn't scale well and requires downloading large amounts of data.

Question 6 of 10

You're building an ML pipeline that ingests streaming sensor data from IoT devices via Pub/Sub, performs feature engineering with 5-minute tumbling windows, and needs features available for both batch training and online serving. What architecture BEST supports this requirement?

Pub/Sub → Dataflow → BigQuery for batch, and Pub/Sub → Cloud Functions → Vertex AI Feature Store for online Pub/Sub → Dataflow → Vertex AI Feature Store (with both batch and online serving enabled) Pub/Sub → Dataflow → Cloud Storage → scheduled batch job to Feature Store Pub/Sub → Cloud Run → Cloud SQL → Vertex AI Pipelines for feature extraction

Explanation

Option B is correct because: 1) Single pipeline maintains consistency between batch and online features, 2) Dataflow efficiently handles windowing operations on streaming data, 3) Vertex AI Feature Store natively supports both serving modes from a single source, 4) Eliminates duplicate pipelines and training-serving skew. Option A creates separate pipelines increasing complexity and skew risk. Option C only supports batch serving. Option D uses Cloud Run which isn't optimal for stateful streaming windowing operations and Cloud SQL isn't designed for feature serving at scale.

Question 7 of 10

Your organization has customer data in Cloud Spanner (100 GB) that needs preprocessing for an ML model. The preprocessing includes complex joins with reference data in BigQuery (10 TB) and aggregations. What is the MOST efficient approach?

Export Cloud Spanner data to Cloud Storage, load into BigQuery, perform all preprocessing in BigQuery Use Dataflow to read from both Cloud Spanner and BigQuery, perform joins in Dataflow, output to Cloud Storage Use federated queries in BigQuery to join Spanner data with BigQuery tables directly Replicate BigQuery data to Cloud Spanner and perform all operations in Spanner

Explanation

BigQuery federated queries (Option C) allow querying external data sources including Cloud Spanner without data movement: 1) Eliminates data duplication and export time, 2) Leverages BigQuery's processing power for large-scale aggregations on the 10 TB dataset, 3) Cost-effective as you only pay for BigQuery queries, 4) Maintains data freshness. Option A involves unnecessary data movement and time. Option B requires managing Dataflow resources and is less efficient for large-scale joins compared to BigQuery. Option D is impractical and expensive - replicating 10 TB to Spanner and Spanner isn't optimized for analytical workloads.

Question 8 of 10

You need to ingest and preprocess text documents (PDFs, Word files) stored in Cloud Storage for a document classification model using Vertex AI. The preprocessing must extract text, clean formatting, and tokenize. Which solution provides the BEST integration with Vertex AI training?

Use Document AI to extract text from documents, Cloud Functions to trigger processing, store results in BigQuery, create Vertex AI Dataset Use Apache Tika on Dataproc for text extraction, save to Cloud Storage as text files, import into Vertex AI managed dataset Create a custom container with PyPDF2 and python-docx, use Vertex AI Pipelines to orchestrate extraction and preprocessing, output to managed dataset Use Cloud Vision API for OCR, Cloud Natural Language API for processing, manually upload to Vertex AI

Explanation

Option C (Vertex AI Pipelines with custom container) is best because: 1) End-to-end ML workflow orchestration within Vertex AI ecosystem, 2) Reproducible and versionable preprocessing pipeline, 3) Direct output to Vertex AI managed datasets, 4) Can include preprocessing as part of the training pipeline, 5) Supports custom logic for various document formats. Option A uses Document AI (adds cost) and has disjointed components. Option B requires Dataproc cluster management overhead. Option D uses expensive APIs unnecessarily (Vision API for text documents) and lacks automation.

Question 9 of 10

Your ML team is preprocessing a dataset in BigQuery that contains user email addresses, IP addresses, and transaction amounts. You need to preserve data utility for ML while protecting privacy. The model needs to learn patterns but not memorize specific users. What technique is MOST appropriate?

Use k-anonymity by generalizing email domains to company names and IP addresses to /24 subnets Apply deterministic encryption to email and IP addresses using Cloud KMS Use differential privacy by adding calibrated noise to transaction amounts and hashing emails/IPs Remove email and IP columns entirely and only use transaction amounts

Explanation

Differential privacy (Option C) is most appropriate because: 1) Adds mathematical privacy guarantees preventing individual identification, 2) Maintains statistical properties needed for ML, 3) Hashing provides consistent representation for categorical features (email/IP), 4) Noise addition to numerical features prevents exact value memorization while preserving distributions. Option A (k-anonymity) may not provide sufficient privacy for re-identification attacks. Option B (encryption) doesn't reduce privacy risk in ML - the model can still learn encrypted values. Option D loses potentially valuable features like email domain patterns or geographic information from IPs.

Question 10 of 10

You're managing multiple datasets in Vertex AI for different ML projects. The datasets include images (50 TB), tabular data (500 GB), and text (1 TB). You need to implement data versioning and lineage tracking while optimizing storage costs. What is the BEST strategy?

Store all raw data in Cloud Storage with versioning enabled, use Vertex AI managed datasets with metadata pointing to Cloud Storage versions, enable Data Catalog for lineage Store everything in BigQuery with table snapshots for versioning, use Vertex AI datasets linked to BigQuery tables Use Cloud Storage with custom version labels in filenames, track lineage manually in spreadsheets, create Vertex AI datasets for each version Store data in Cloud Storage buckets organized by project, use Git for version tracking, manually register datasets in Vertex AI

Explanation

Option A is the best approach because: 1) Cloud Storage object versioning provides native version control without duplication costs (only deltas stored), 2) Vertex AI managed datasets provide metadata management and integration with training/pipelines, 3) Data Catalog automatically captures lineage for GCP services, 4) Cost-effective for mixed data types especially large binary files (images), 5) Scalable and follows GCP best practices. Option B is expensive for large binary data (images) and BigQuery isn't designed for this. Option C lacks automation and proper lineage tracking tools. Option D uses Git inappropriately for large datasets and requires manual work prone to errors.

Welcome Back

Terms of Service & Privacy Notice

Create Account

Verify Your Phone Number

Reset Your Password

Login Required

Get in Touch

Exploring and preprocessing organization-wide data (e.g., Cloud Storage, BigQuery, Spanner, Cloud SQL, Apache Spark, Apache Hadoop)