How To List All BigQuery Datasets and Tables with Python
BigQuery is the managed Data Warehousing service on Google Cloud Platform that lets users store, manage and query data. A big part of Data and/or Analytics Engineering is the automation of certain tasks, including some interactions with BigQuery. Such automations usually require us to make use of the BigQuery API that enables developers programmatically interact with the service on GCP.
In today's article we will demonstrate how to use BigQuery API and Google Python client in order to programmatically fetch all tables and datasets under a single GCP project.
Prerequisites
In order to follow along the steps outlined in this tutorial, make sure to install the Google-provided Python client library. To do so, we need to install google-cloud-bigquery
via pip
:
$ python3 -m pip install --upgrade google-cloud-bigquery
Additionally, you'll have to authenticate the client via Application Default Credentials (ADC). To do so, you will first need to make sure that you have already installed gcloud
Command Line Interface (CLI). Now since we are developing our code on our local machines, all we need to do is authenticate via ADC, as outlined below:
$ gcloud auth application-default login
After running the command, a new tab in your browser will open, asking you to login using your Google account. Note that you will make sure that your personal Google Cloud account has sufficient permissions to perform the operations we will be running in the next few sections.
Listing all BigQuery datasets
First, let's start by creating an instance of a BigQuery client:
from google.cloud import bigquery
# The client we'll use to interact with BigQuery
client = bigquery.Client(project='my-gcp-project')
We can now use the client to featch all BigQuery datasets by calling list_datasets()
method that returns an iterator pointing to objects of type [DaasetListItem](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.dataset.DatasetListItem)
.
# Get an iterator pointing to `DaasetListItem` objects
bq_datasets = list(client.list_datasets())
Now we can iterate over bq_datasets
object and print the BigQuery datasets for the project we used when instantiating the client.
# Print BigQuery datasets in project `my-gcp-project`
for dataset in bq_datasets:
print(f'{client.project=}, {dataset.dataset_id=}')
Here's a complete version of the code, combining all the code snippets we discussed earlier.
"""
Script used to iterate over BigQuery datasets in a single
BigQuery project.
"""
from google.cloud import bigquery
# The client we'll use to interact with BigQuery
client = bigquery.Client(project='my-gcp-project')
# Get an iterator pointing to `DaasetListItem` objects
bq_datasets = client.list_datasets()
# Print BigQuery datasets in project `my-gcp-project`
for dataset in bq_datasets:
print(f'{client.project=}, {dataset.dataset_id=}')
Lastly, this is a sample output generated by our script:
client.project='my-gcp-project', dataset.dataset_id='my_dataset'
client.project='my-gcp-project', dataset.dataset_id='another_dataset'
client.project='my-gcp-project', dataset.dataset_id='oh_heres_another_one'
Listing all BigQuery tables
Similarly, in order to list all the tables in the BigQuery project of interest, we will once again have to iterate through the datasets, and then list all tables included in every single dataset.
To do so, we'll need to call client.list_tables()
for every dataset in order to get an iterator pointing to objects of type [TableListItem](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.table.TableListItem)
.
for table in client.list_tables(dataset.dataset_id):
print(f'{table.table_id=}, {dataset.dataset_id=}, {client.project=}')
Here's the full code consiting of all steps required to print to the standard output all table names in a BigQuery project.
"""
Script used to iterate over BigQuery table names for every single
dataset in a particular BigQuery project.
"""
from google.cloud import bigquery
# The client we'll use to interact with BigQuery
client = bigquery.Client(project='my-gcp-project')
# Get an iterator pointing to `DaasetListItem` objects
bq_datasets = client.list_datasets()
# Print BigQuery table and dataset names in project `my-gcp-project`
for dataset in bq_datasets:
for table in client.list_tables(dataset.dataset_id):
print(f'{table.table_id=}, {dataset.dataset_id=}, {client.project=}')
Output:
table.table_id='my_table', dataset.dataset_id='my_dataset', client.project='my-gcp-project'
table.table_id='another_table', dataset.dataset_id='my_dataset', client.project='my-gcp-project'
table.table_id='temp_table', dataset.dataset_id='temp_ds', client.project='my-gcp-project'
table.table_id='temp_table_2', dataset.dataset_id='temp_ds', client.project='my-gcp-project'
Counting number of datasets and tables in BigQuery
Now that you know how to use the Python client to interact with BigQuery, you can also infer the number of datasets in a single project.
"""
Script used to count number of datasets for a BigQuery project
"""
from google.cloud import bigquery
# The client we'll use to interact with BigQuery
client = bigquery.Client(project='my-gcp-project')
# Turn iterator into list, and count its length
dataset_count = len(list(client.list_datasets()))
print(f'Number of datasets in project {client.project}: {dataset_count}')
Similarly, we can even count the number of tables per dataset, or per project:
"""
Script used to count number of tables per dataset for a BigQuery project
"""
from google.cloud import bigquery
# The client we'll use to interact with BigQuery
client = bigquery.Client(project='my-gcp-project')
# Turn iterator into list, and count its length
table_total_count = 0
for dataset in client.list_datasets():
table_count = len(list(client.list_tables(dataset.dataset_id)))
table_total_count += table_count
print(f'No. of tables for dataset {dataset.dataset_id}: {table_count}')
print(f'Number of tables in project {client.project}: {table_total_count}')
Final Thoughts
In this quick tutorial, we outlined the steps required in order to make use of the BigQuery API, via the Google provided Python client, including both the installation and the authentication steps.
Furthermore, we demonstrated how to make use of the BigQuery Python client in order to programatically list datasets and tables in a BigQuery GCP project. Lastly, we also created a couple of scripts you can use to count the number of datasets or tables.
I hope you found this article useful. If by any chance you are having troubles running the code shared in the tutorial, let me know in the comments and will try my best to help you debug and run the code succesfully.