⚡ Free Classes and Scholarships Available for Underprivileged Students -

Exploring and preprocessing organization-wide data (e.g., Cloud Storage, BigQuery, Spanner, Cloud SQL, Apache Spark, Apache Hadoop)

Collaborating within and across teams to manage data and models

10 Questions
No time limit
Practice Mode
0%
Score
0
Correct
0
Incorrect
10
Total Questions
Back to Topics
Question 1 of 10
Your organization stores 500 TB of structured transaction data in BigQuery that needs to be preprocessed for a fraud detection model. The preprocessing involves complex windowing operations calculating rolling statistics over 90-day periods. Which approach would be MOST cost-effective and performant?
Explanation
BigQuery SQL with window functions is the most cost-effective solution because: 1) Data stays in BigQuery eliminating expensive data movement, 2) BigQuery's columnar storage and distributed architecture is optimized for analytical queries with window functions, 3) No additional compute resources needed. Option A incurs unnecessary data egress costs and Dataflow compute costs. Option C requires a Dataproc cluster which is more expensive for this use case. Option D doesn't scale well for 500 TB and would require complex chunking logic with potential memory issues.
Question 2 of 10
You need to create a unified feature store for multiple ML teams working on customer segmentation models. Source data includes real-time clickstream events (Cloud Pub/Sub), historical purchase data (BigQuery), and customer profiles (Cloud SQL). What is the BEST approach to consolidate these features in Vertex AI Feature Store?
Explanation
Option C is correct because it follows best practices for hybrid ingestion patterns: streaming sources (Pub/Sub) should use streaming pipelines to maintain low latency, while batch sources (BigQuery, Cloud SQL) can use scheduled batch ingestion which is more cost-effective. Option A doesn't efficiently handle batch sources. Option B processes streaming data in batches, losing real-time benefits. Option D adds unnecessary intermediate storage and doesn't leverage streaming capabilities for real-time data.
Question 3 of 10
Your healthcare ML project needs to preprocess patient data containing PHI from Cloud SQL before training. The data must be de-identified to comply with HIPAA. Which combination of GCP services provides the MOST comprehensive de-identification solution?
Explanation
Cloud DLP (Data Loss Prevention) API is the correct choice because: 1) It's specifically designed for detecting and de-identifying sensitive data including PHI, 2) Provides 50+ built-in infoType detectors for PII/PHI, 3) Offers various de-identification methods (masking, tokenization, crypto-based hashing, date shifting), 4) Helps maintain HIPAA compliance with audit trails. Option A lacks comprehensive detection capabilities. Option C using simple HASH functions doesn't provide proper de-identification techniques like format-preserving encryption or k-anonymity. Option D requires manual implementation and lacks the robustness of Cloud DLP.
Question 4 of 10
You're preprocessing 10 PB of unstructured image data stored in Cloud Storage for a computer vision model. The preprocessing includes resizing, normalization, and augmentation. Training will use Vertex AI with TPUs. What is the MOST efficient data organization strategy?
Explanation
TFRecord format (Option B) is optimal for large-scale image training because: 1) Sequential reads are more efficient than random access to millions of small files, 2) Reduces storage and I/O overhead, 3) Better integration with TPUs and tf.data pipeline, 4) Enables efficient distributed training with sharding. Option A suffers from small-file problem causing excessive metadata operations. Option C is not designed for binary image data at this scale. Option D (Filestore) is expensive and unnecessary when Cloud Storage with TFRecords provides sufficient throughput for TPU training.
Question 5 of 10
Your team needs to perform exploratory data analysis on 2 TB of JSON log files stored in Cloud Storage before building a text classification model. The analysis requires interactive querying and visualization. Which approach minimizes setup time and cost?
Explanation
BigQuery external tables (Option A) are ideal for this scenario because: 1) No data movement or loading time required, 2) Zero storage cost in BigQuery, 3) Automatic JSON schema detection, 4) Serverless with pay-per-query pricing, 5) Can directly visualize results in Looker Studio or Data Studio. Option B requires cluster provisioning and management overhead. Option C is not designed for 2TB of log data and would be expensive. Option D doesn't scale well and requires downloading large amounts of data.
Question 6 of 10
You're building an ML pipeline that ingests streaming sensor data from IoT devices via Pub/Sub, performs feature engineering with 5-minute tumbling windows, and needs features available for both batch training and online serving. What architecture BEST supports this requirement?
Explanation
Option B is correct because: 1) Single pipeline maintains consistency between batch and online features, 2) Dataflow efficiently handles windowing operations on streaming data, 3) Vertex AI Feature Store natively supports both serving modes from a single source, 4) Eliminates duplicate pipelines and training-serving skew. Option A creates separate pipelines increasing complexity and skew risk. Option C only supports batch serving. Option D uses Cloud Run which isn't optimal for stateful streaming windowing operations and Cloud SQL isn't designed for feature serving at scale.
Question 7 of 10
Your organization has customer data in Cloud Spanner (100 GB) that needs preprocessing for an ML model. The preprocessing includes complex joins with reference data in BigQuery (10 TB) and aggregations. What is the MOST efficient approach?
Explanation
BigQuery federated queries (Option C) allow querying external data sources including Cloud Spanner without data movement: 1) Eliminates data duplication and export time, 2) Leverages BigQuery's processing power for large-scale aggregations on the 10 TB dataset, 3) Cost-effective as you only pay for BigQuery queries, 4) Maintains data freshness. Option A involves unnecessary data movement and time. Option B requires managing Dataflow resources and is less efficient for large-scale joins compared to BigQuery. Option D is impractical and expensive - replicating 10 TB to Spanner and Spanner isn't optimized for analytical workloads.
Question 8 of 10
You need to ingest and preprocess text documents (PDFs, Word files) stored in Cloud Storage for a document classification model using Vertex AI. The preprocessing must extract text, clean formatting, and tokenize. Which solution provides the BEST integration with Vertex AI training?
Explanation
Option C (Vertex AI Pipelines with custom container) is best because: 1) End-to-end ML workflow orchestration within Vertex AI ecosystem, 2) Reproducible and versionable preprocessing pipeline, 3) Direct output to Vertex AI managed datasets, 4) Can include preprocessing as part of the training pipeline, 5) Supports custom logic for various document formats. Option A uses Document AI (adds cost) and has disjointed components. Option B requires Dataproc cluster management overhead. Option D uses expensive APIs unnecessarily (Vision API for text documents) and lacks automation.
Question 9 of 10
Your ML team is preprocessing a dataset in BigQuery that contains user email addresses, IP addresses, and transaction amounts. You need to preserve data utility for ML while protecting privacy. The model needs to learn patterns but not memorize specific users. What technique is MOST appropriate?
Explanation
Differential privacy (Option C) is most appropriate because: 1) Adds mathematical privacy guarantees preventing individual identification, 2) Maintains statistical properties needed for ML, 3) Hashing provides consistent representation for categorical features (email/IP), 4) Noise addition to numerical features prevents exact value memorization while preserving distributions. Option A (k-anonymity) may not provide sufficient privacy for re-identification attacks. Option B (encryption) doesn't reduce privacy risk in ML - the model can still learn encrypted values. Option D loses potentially valuable features like email domain patterns or geographic information from IPs.
Question 10 of 10
You're managing multiple datasets in Vertex AI for different ML projects. The datasets include images (50 TB), tabular data (500 GB), and text (1 TB). You need to implement data versioning and lineage tracking while optimizing storage costs. What is the BEST strategy?
Explanation
Option A is the best approach because: 1) Cloud Storage object versioning provides native version control without duplication costs (only deltas stored), 2) Vertex AI managed datasets provide metadata management and integration with training/pipelines, 3) Data Catalog automatically captures lineage for GCP services, 4) Cost-effective for mixed data types especially large binary files (images), 5) Scalable and follows GCP best practices. Option B is expensive for large binary data (images) and BigQuery isn't designed for this. Option C lacks automation and proper lineage tracking tools. Option D uses Git inappropriately for large datasets and requires manual work prone to errors.