The Last of the Bookies

John was a fresh-faced twentysomething bookie, not long out of college, when the cops kicked in the door of his uncle’s gambling den, located deep within the seventh floor of a Manhattan walk-up. The…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Working with Nested and Repeated Fields in BigQuery

If you have worked with nested and repeated columns and found them a bit challenging, then this article is for you. In this blog, we explain the nitty-gritty of such kinds of fields in BigQuery. We will discuss in detail why they are preferred in data warehousing and how to create and query them. Let’s start by defining what nested and repeating fields are.

In relational databases, data are organized in a way that relationships between key data points can easily be identified. For example, if we want to analyze the purchasing behavior of a customer, we may derive information from three tables containing information about the customer, products, and order transactions. A single product in the products table can have multiple rows in the orders table, which makes it difficult to put the data in a single row. The process of organizing relational databases in a way that minimizes data redundancy and enhances data integrity is called normalization.

For data warehousing, data analysts and engineers often are interested in taking the denormalization route and combining several separate tables into one large reporting table without any data redundancy. Google’s BigQuery offers a way to handle these processes using nested and repeated fields.

Let’s have a look at an example of nested and repeated data. The data consists the following fields:

Here is what the JSON data file of the above example would look like.

{“id”:”1",”name”:”Xavier Williams”, “location”:[“city”:”Berlin”,”country”:”Germany”]}

{“id”:”1",”name”:”Crystal Wolf”, “location”:[“city”:”Munich”,”country”:”Germany”]}

Basically, the nested data are multiple fields within each location and the repeated data are the multiple locations.

Now let’s take a closer look at nested and repeating fields in BigQuery, and then see why this is such a popular design for data analysts and engineers. We’ll illustrate this with an e-commerce example running on Google Cloud. This company stores transactional orders in BigQuery in the “orders” table.

As a data analyst or engineer, how can you efficiently store this disparate information in your data warehouse? Note that the company needs to support a large user base that requests large amounts of data on a daily basis.

As shown in figure 1 below, one method is that we can store each fact in one place, as is common in relational database management systems. So what are the disadvantages of this method? When working with communications programs, our data from different locations need to be brought together for analysis. This requires lots of joining. Joining, especially in the case of very large tables, means a high analysis effort and is therefore expensive. This is because DBMS are record-based and they must fully open each record and extract the join key from the table it is joined to.

Normalized Schema

Another approach is that we can go the fully denormalized route and store all levels of granularity in one big table. Although it speeds up data reading, this option has several disadvantages. In the schema shown in Figure 2, information common to each order is repeated. That means we have to be very careful when analyzing data at different levels of granularity. We must be careful not to double or triple count when trying to calculate.

Fully Denormalized Schema

This is where nested and repeated data fields come in handy. We can have a row for each order and repeated values in a row for more detailed data. For example, the repeated and nested table below clearly shows that there are only three rows for three unique order IDs.

Denormalizing Data with Nested and Repeated Fields

If you are looking at a nested and repeated table, like the one in figure 3, for the first time, you might be wondering about column names like “orders.price” and ask yourself “What’s a period doing in those column names?” Dotted columns are called STRUCT or structured data types in SQL. We can think of them as pre-joined tables within a table. Instead of having a separate table for products, orders, and customers, we can just put all of them on a big table.

By using STRUCT we simplify the query since the data is conceptually pre-joined. The column-based nature of BigQuery makes having a huge table with repeating and nested fields the most efficient and cost-effective option. For example, if we want to know the total quantities ordered, BigQuery returns the desired result by accessing the “order.quantity” column without having to look at other columns. The ability to have lots of data in one place and be able to easily analyze it without being hampered by the problem of multi-granularity is a very desirable feature for analysts.

STRUCTs are containers in which multiple fields and data types can be nested. ARRAYs can be one of the field types within a STRUCT. Notice all the empty space between the rows, that’s because some columns are at a deeper level of granularity — meaning there are multiple duplicate values for those columns. An ARRAY is the perfect data type to handle that repeated value and store that data in a single row. In a schema, ARRAY data types are identified as repeating by observing their mode in the table schema.

One thing to note here is that STRUCT and ARRAY data types in SQL are independent of each other. A table may have an ARRAY column that has no relation to any structure. Also, there can be a STRUCT with no ARRAY fields. We want to use them both at the same time because we want to take advantage of their unique features. While an ARRAY allows us to go deep into different levels of granularity, STRUCT helps us to go wide by pre-joining related fields together.

The traditional relational database system is advantageous when we have tables smaller than 10 gigabytes. Therefore it is better to keep the tables normalized. The smaller the size, the less impact JOIN has on performance. However, as data grows in size, so does the performance impact of a JOIN. In this case, denormalizing our data would be the desirable option. As a rule of thumb, 10 gigabytes is taken as a threshold.

Let’s try the STRUCT syntax and create a table.

Since ‘id’ and ‘name’ are a subset of products, we have a dot notation.

Running the following query will create the repeated and nested table shown in figure 3.

Querying nested and repeated fields is a little bit tricky. Let’s try to query the name of a product purchased by a customer with an ID 101.

The above query will throw the following error:

“Cannot access field id on a value with type ARRAY<STRUCT<id STRING, name STRING, email STRING, …>> at [4:17]”.

Since repeated fields (arrays) are stored in a single row, we need to first break them back into rows so that we can query them normally. To extract information from a repeated field in BigQuery, we must use a more exotic pattern. This is normally done using the UNNEST function, which converts an array of values from a table into rows. This unnest function helps to flatten the data, making accessing nested data easier.

Now, let’s try to solve the previous error by using UNNEST.

Similarly, let’s extract order.price and order.quantity from the orders field

Now that you are familiar with the nested data and unnest function. Let’s calculate the most product sold from our database

That’s it! Now it’s your turn to use whatever you have learned to query complex data.

Here are a few open-source data for you to practice in BigQuery.

This article introduced you to the steps required for BigQuery Nested Fields & Repeated Fields. Furthermore, you learned about what and how to create and query BigQuery Nested & Repeated Fields.

With your Data Warehouse, Google BigQuery live and running, you’ll need to extract data from multiple platforms to carry out your analysis. However, integrating and analyzing your data from a diverse set of data sources can be challenging and this is where datadice comes into the picture.

If you are looking for help to set up a modern and cost-efficient data warehouse or analytical dashboards, send us an email at hello@datadice.io and we will schedule a call.

Add a comment

Related posts:

LIVING ROOM

This venue was completely reimagined to create a timeless space that builds appeal and interest through layered decor elements. The final result exceeded the clients’ expectations and has become a…

Touring the Different Types of AI

Artificial Intelligence is a fascinating and very broad field. In this article we’ll talk about the various types of AI and what you can do with each one. Specifically we’ll cover content generation…

Story of a Kissless Encounter

On a day before today.. Early june or late in Mai.. I was walking on the way.. Where the angels used to stay.. She was there walking with me.. Many smiles in her pocket, and a poem in her wallet, and…