Parquet - Introduction

Parquet - Introduction

Published September 20, 2025 00:00
parquet binary columnar file format data compression metadata encryption Apache Thrift Flow PHP

Parquet is a binary, columnar file format designed for efficient data storage and querying.

There are tons of articles about Parquet out there, so why write another one?
This is my take on this fantastic format, essentially the result of my experience working on a pure PHP implementation of Parquet.

For those who stumbled here by accident, I should mention I'm the author of the first data processing framework for PHP called Flow PHP.
As any proper DataFrame framework should, Flow needs to read and write data in various formats, including Parquet

However, since the only implementation I found was basically a direct port from C#, which doesn't handle deeply nested structures particularly well and has quite a few missing features, I decided to write my own implementation from scratch as a learning exercise. This turned out to be an incredibly valuable experience and great fun.

Why Parquet

Binary Format

Thanks to being column-oriented rather than row-oriented, this format enables highly efficient data compression, which translates to significantly smaller file sizes. Without much effort, Parquet can compress data by up to 10 times compared to traditional formats like CSV or XML.

So if the same data stored in CSV format takes up 1GB, in Parquet format it might only take 100MB.

For this post, I generated 2 files - one in CSV format, another in Parquet.
The structure of these files is very simple, containing 10 columns and 10 million rows that look something like this:

index,order_id,created_at,updated_at,discount,email,customer,address,notes,items
0,254d61c5-22c8-4407-83a2-76f1cab53af2,2025-01-01T12:00:00+00:00,2025-01-01T12:10:00+00:00,24.4,[email protected],"John Doe 0","{""street"":""123 Main St, Apt 0"",""city"":""City "",""zip"":""12345-0"",""country"":""PL""}","[""Note 1 for order 0"",""Note 2 for order 0"",""Note 3 for order 0""]","[{""sku"":""SKU_0001"",""quantity"":1,""price"":0.14},{""sku"":""SKU_0002"",""quantity"":2,""price"":25.13}]"
1,254d61c5-22c8-4407-83a2-76f1cab53af2,2025-01-01T12:00:00+00:00,2025-01-01T12:10:00+00:00,24.4,[email protected],"John Doe 1","{""street"":""123 Main St, Apt 1"",""city"":""City "",""zip"":""12345-1"",""country"":""PL""}","[""Note 1 for order 1"",""Note 2 for order 1"",""Note 3 for order 1""]","[{""sku"":""SKU_0001"",""quantity"":1,""price"":0.14},{""sku"":""SKU_0002"",""quantity"":2,""price"":25.13}]"
2,254d61c5-22c8-4407-83a2-76f1cab53af2,2025-01-01T12:00:00+00:00,,,[email protected],"John Doe 2","{""street"":""123 Main St, Apt 2"",""city"":""City "",""zip"":""12345-2"",""country"":""PL""}","[""Note 1 for order 2"",""Note 2 for order 2"",""Note 3 for order 2""]","[{""sku"":""SKU_0001"",""quantity"":1,""price"":0.14},{""sku"":""SKU_0002"",""quantity"":2,""price"":25.13}]"
3,254d61c5-22c8-4407-83a2-76f1cab53af2,2025-01-01T12:00:00+00:00,,24.4,[email protected],"John Doe 3","{""street"":""123 Main St, Apt 3"",""city"":""City "",""zip"":""12345-3"",""country"":""PL""}","[""Note 1 for order 3"",""Note 2 for order 3"",""Note 3 for order 3""]","[{""sku"":""SKU_0001"",""quantity"":1,""price"":0.14},{""sku"":""SKU_0002"",""quantity"":2,""price"":25.13}]"

The compression effect is truly impressive:

4.1G Sep 20 18:32 orders.csv
437M Sep 20 18:47 orders.parquet

This translates not only to storage costs but also data processing costs.
Especially when your data lives in the cloud, whether on Azure Bucket or AWS S3. One of the biggest factors affecting your bill isn't actually the size of the data, but how much transfer you use to read/write that data.

So by reducing file size, we reduce not only the cost of storing it but also processing it. It's important to understand that processing really means any form of access - read/write operations.

This comes down to the fact that by choosing the right file format, savings can be really substantial, especially when we're talking about larger amounts of data.

What does it actually mean for Parquet to be a binary format?

It means roughly that data is stored in binary form - that is, in a form that can't be directly read using popular text editors.

But everything is ultimately stored in binary form, isn't it?

Yes, generally text files are also binary files. The difference is that in text files the structure is always the same and each piece of information is written the same way.

For example, if we wanted to save "12345" in a text file, the binary version would look like this:

STRING: "12345"
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Character:   '1'       '2'      '3'      '4'     '5'      '\0'
ASCII:       49        50       51        52     53        0
Binary:    00110001 00110010 00110011 00110100 00110101 00000000
           โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜

Total: 6 bytes (including null terminator)

The same string saved in binary format as int32 (integer in 32-bit form) would look like this:

INTEGER: 12345
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Integer:     0         0       48       57
Binary:  00000000  00000000 00110000 00111001
         โ””โ”€byteโ”€โ”˜  โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜ โ””โ”€byteโ”€โ”˜

Total: 4 bytes for a 32-bit integer

Notice that in the case of saving an integer in binary form, you can't just read it from left to right (or vice versa). Here we need to know how to interpret these bits to understand what they mean. With text files we don't have this problem, since we know that each character is saved in 8-bit form.

This is roughly why any text editor can open any text file and show us something that makes more or less sense.

However, if we try to open a Parquet file in a text editor, we'll get a string of characters that looks very random and doesn't make much sense.

Columnar vs Row-based

The best way to explain the difference between these formats is through visualization.

In the classic row-based model, each row contains all columns, like in CSV format

+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
|  A1  |  B1  |  C1  |
|  A2  |  B2  |  C2  |
|  A3  |  B3  |  C3  |
+------+------+------+

The columnar format is interesting in that instead of storing data row by row, it stores column by column.

+------+------+------+------+
| Col1 |  A1  |  A2  |  A3  |
+------+------+------+------+
| Col2 |  B1  |  B2  |  B3  |
+------+------+------+------+
| Col3 |  C1  |  C2  |  C3  |
+------+------+------+------+

Storing data in columnar format brings many benefits, such as:

With a row-based format, to read just one column, we still have to go through the entire file.
With a columnar format, we can read only the columns that interest us.
This is particularly useful with very large datasets where we often need only part of the information.

Immutable

Due to how data is stored in columnar format, Parquet files are immutable.
This doesn't mean they can't be modified. They can, but the only sensible operation is appending data at the end.

Why? Parquet stores data in columnar format, meaning that if we have an `email` column, all rows (in a given row group and page - more on this later) are written one after another.
Trying to modify one row is therefore impossible, as it would require shifting practically the entire file.

However, it's possible to add a new row group at the end of the file. This is done by removing metadata from the end of the file, which temporarily goes to memory. In their place, a new row group is written (which also needs to be added to the metadata), and then the metadata is written again at the very end.

For this reason, if we want to remove something from a Parquet file, in practice we need to rewrite the entire file, excluding the unwanted data.

Strong Structure

Parquet is a format based on strong typing. This means the structure of the entire file is defined and stored in the footer, so you only need to read the appropriate segment to understand what data we have in the file and in which regions of the file that data is written.

We can think of this as a file map - a map that will tell us exactly where in the file the data we're interested in is located.

Here's roughly what a simplified Parquet file structure looks like:

+-----------------+
| PAR1            |
+-----------------+
| Data            |
| .............   |
| .............   |
+-----------------+
| File Metadata   |
+-----------------+
| PAR1            |
+-----------------+

In the example above we see 3 elements:

The first step to properly reading a Parquet file is checking if the first 4 bytes are PAR1.
If so, we need to jump to the end of the file (seek) and read the last 4 bytes.

If the end and beginning of the file contain PAR1, we can proceed to read the metadata.

To do this, we go back 8 bytes from the end of the file and read 4 bytes representing the metadata size. In other words, we read bytes -8 to -4

These 4 bytes are an integer telling us how many bytes the metadata is written on. Having this information, we can read the metadata, which is serialized in binary using Apache Thrift

Apache Thrift

Apache Thrift is a very clever tool that allows binary serialization of interfaces/types in practically any programming language.

Here we can see what the Parquet metadata definition looks like.

This format somewhat resembles pseudocode, which is then used by the appropriate application to generate code in a given programming language.

Here we can see what the generated code looks like in PHP.

Once we have the generated structures/interfaces/models, we can proceed to reading.

<?php

use Flow\Parquet\Thrift\FileMetaData;
use Thrift\Protocol\TCompactProtocol;
use Thrift\Transport\TMemoryBuffer;

$metadataLength = \unpack($this->byteOrder->value, $this->stream->read(4, $fileTotalSize - 8))[1];

$fileMetadata = new FileMetaData();
$fileMetadata->read(
    new TCompactProtocol(
        new TMemoryBuffer(
            $this->stream->read($metadataLength, $fileTotalSize - ($metadataLength + 8))
        )
    )
);

For this we'll need the Thrift library for our chosen programming language. All implementations are available in the apache/thrift repository.

Having access to $metadata, we can start analyzing our file to understand its structure.

Parquet - FileMetaData

struct FileMetaData {
  1: required i32 version
  2: required list<SchemaElement> schema;
  3: required i64 num_rows
  4: required list<RowGroup> row_groups
  5: optional list<KeyValue> key_value_metadata
  6: optional string created_by
  7: optional list<ColumnOrder> column_orders;
  8: optional EncryptionAlgorithm encryption_algorithm
  9: optional binary footer_signing_key_metadata
}

Key information about the file is stored in the FileMetaData structure. The most important ones are:

Format Versions

At the time of writing this article, Parquet format was already available in version 2.12.0.

The most crucial changes between versions 1.0 and 2.0 are:

Even though version 2.0 introduces many improvements, the biggest players still use version 1 by default.

Row Count

This information might seem unintuitive at first in the context of columnar format.
However, we must remember that columnar format is just a way of storing values, not data structure.

Despite data being grouped by columns and their type, reading/writing still happens in the classic way - row by row.

The difference is that we don't read one row at a time, but an entire group of rows, loading column by column into memory, then reconstructing rows based on appropriate indexes.

Remembering that to properly write data in columnar format we must operate on logical groups, not individual rows. We can relatively easily manage the balance between memory usage and the number of IO operations.

Reading and writing from memory is faster than reading and writing from disk (though not always). By increasing the number of rows that will be written in one group, we reduce the number of groups, thus the number of IO operations.
This increases write/read speed while increasing memory usage.

This also works the other way - reducing the number of rows in a group increases the number of groups in the file, thus increasing the number of IO operations.

Group size, not row count - Parquet allows defining not the number of rows, but the maximum size of a row group.
However, remember these aren't absolute values (more on this later), so some groups may be smaller/larger than the allowed size, depending mainly on the Parquet library implementation.

In the Parquet format documentation, we'll find that the suggested group size is 512MB - 1GB. However, it's worth approaching this with some common sense, especially if we're not relying on HDFS (Hadoop Distributed File System) for reading/writing.
The suggested value is set so that one row group fits in one HDFS block, guaranteeing that reading happens from exactly one node.

It's worth remembering this, but if we don't plan to use Parquet with a distributed file system, smaller row groups will allow us to save a lot of memory.

A very good example of when smaller groups are more efficient is when we want to read only a small slice of rows from somewhere in the middle of the file (pagination).

Assuming we need to read only 100 rows from a file containing 10 million rows, setting a smaller group size will allow us to save a lot on memory. Why?

If we divide 10 million into, say, 10 groups, each group contains 1 million rows. This means in practice we need to read the entire group, then extract only the 100 rows we're interested in.

With a smaller group size that allows dividing 10 million into 1000 groups, analyzing the file metadata, we'll be able to skip more groups and load much fewer rows into memory.

The decision about row group size should be thoughtful, considering both write and read performance for the specific file. Proper configuration directly translates to resource usage, which ultimately translates to money.

Schema

We're slowly getting to the heart of Parquet - Row Groups. But before we analyze their structure, we need to go back to another very important aspect of Parquet: the data schema.

Let's start with data types. Parquet consists of physical and logical types.

Physical Types

Physical types are the basic data types used to store values in a Parquet file. These are types like:

Logical types are types used to represent more complex data structures. You can think of them as extensions of physical types.

Logical Types

The current structure can always be checked at the source: apache/parquet-format

Besides the division into logical and physical types, Parquet also distinguishes flat and nested columns.
Flat columns are those that store a single value, e.g., Int32, Boolean, Float, etc.
Nested columns are those that store more than one value, e.g., List, Map, etc.

There are actually 3 types of nested columns:

Struct is a special type of column that allows nesting any other types, making it possible to create practically any data structure.

Using the above types, we can model practically any data structure and then efficiently store and query it.

Let's look at the Thrift definitions of SchemaElement and a few related elements.

struct SchemaElement {
  1: optional Type type;
  2: optional i32 type_length;
  3: optional FieldRepetitionType repetition_type;
  4: required string name;
  5: optional i32 num_children;
  6: optional ConvertedType converted_type;
  7: optional i32 scale
  8: optional i32 precision
  9: optional i32 field_id;
  10: optional LogicalType logicalType
}

enum FieldRepetitionType {
  REQUIRED = 0;
  OPTIONAL = 1;
  REPEATED = 2;
}

enum Type {
  BOOLEAN = 0;
  INT32 = 1;
  INT64 = 2;
  INT96 = 3;
  FLOAT = 4;
  DOUBLE = 5;
  BYTE_ARRAY = 6;
  FIXED_LEN_BYTE_ARRAY = 7;
}

union LogicalType {
  1:  StringType STRING
  2:  MapType MAP
  3:  ListType LIST
  4:  EnumType ENUM
  5:  DecimalType DECIMAL
  6:  DateType DATE
  7:  TimeType TIME
  8:  TimestampType TIMESTAMP
  10: IntType INTEGER
  11: NullType UNKNOWN
  12: JsonType JSON
  13: BsonType BSON
  14: UUIDType UUID
}

Most values should be fairly obvious, but let's look at FieldRepetitionType.

This value tells us whether a given column is required, optional, or repeatable.
If a column is required, it means the value cannot be null.
If a column is optional, the value can be null, and if it's repeatable, it means it can contain multiple values (e.g., a list).

Here's what an order file schema might look like (in DDL form)

message orders_scheme {
    required fixed_len_byte_array(16) order_id (UUID)
    required int64 created_at (TIMESTAMP(MICROS,false))
    optional int64 updated_at (TIMESTAMP(MICROS,false))
    optional float discount
    required binary email (STRING)
    required binary customer (STRING)
    required group address {
          required binary street (STRING);
          required binary city (STRING);
          required binary zip (STRING);
          required binary country (STRING);
    }
    required group notes (LIST) {
          repeated group list {
                required binary element (STRING);
          }
    }
    required group items (LIST) {
          repeated group list {
                required group element {
                      required binary sku (STRING);
                      required int64 quantity (INTEGER(64,true));
                      required float price;
                }
          }
    }
}

Nested Types

To fully understand the structure of row groups, we first need to understand how Parquet flattens nested types.
While simple structures like address from the above example can basically be reduced to 4 simple columns:

With Map or List, the situation is a bit more complicated.

For example, if we wanted to flatten Map<string,int32> we'd get something like this:

So for the above example, the flat path to sku would look like this: items.list.element.sku, while the flattened complete structure would look like this:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Columns โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ path                        โ”‚ type                     โ”‚ logical type โ”‚ repetition โ”‚ max repetition โ”‚ max definition โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ order_id                    โ”‚ FIXED_LEN_BYTE_ARRAY(16) โ”‚ UUID         โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ created_at                  โ”‚ INT64                    โ”‚ TIMESTAMP    โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ updated_at                  โ”‚ INT64                    โ”‚ TIMESTAMP    โ”‚ OPTIONAL   โ”‚ 0              โ”‚ 1              โ”‚
โ”‚ discount                    โ”‚ FLOAT                    โ”‚ -            โ”‚ OPTIONAL   โ”‚ 0              โ”‚ 1              โ”‚
โ”‚ email                       โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ customer                    โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ address.street              โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ address.city                โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ address.zip                 โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ address.country             โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 0              โ”‚ 0              โ”‚
โ”‚ notes.list.element          โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 1              โ”‚ 1              โ”‚
โ”‚ items.list.element.sku      โ”‚ BYTE_ARRAY               โ”‚ STRING       โ”‚ REQUIRED   โ”‚ 1              โ”‚ 1              โ”‚
โ”‚ items.list.element.quantity โ”‚ INT64                    โ”‚ -            โ”‚ REQUIRED   โ”‚ 1              โ”‚ 1              โ”‚
โ”‚ items.list.element.price    โ”‚ FLOAT                    โ”‚ -            โ”‚ REQUIRED   โ”‚ 1              โ”‚ 1              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Row Groups

+-----------------------------------+
| PAR1                              |
+-----------------------------------+
| Row Group 1                       |
|   +-----------------------------+ |
|   | Column Chunk 1              | |
|   |   (Metadata + Data Pages)   | |
|   +-----------------------------+ |
|   | Column Chunk 2              | |
|   |   (Metadata + Data Pages)   | |
|   +-----------------------------+ |
|   ...                             |
+-----------------------------------+
| Row Group 2                       |
|   +-----------------------------+ |
|   | Column Chunk 1              | |
|   |   (Metadata + Data Pages)   | |
|   +-----------------------------+ |
|   | Column Chunk 2              | |
|   |   (Metadata + Data Pages)   | |
|   +-----------------------------+ |
|   ...                             |
+-----------------------------------+
| ...                               |
+-----------------------------------+
| Metadata                          |
+-----------------------------------+
| PAR1                              |
+-----------------------------------+

As we already know, a Parquet file is divided into row groups. Writing to a file looks roughly like this:

Of course, this description is very simplified. In reality, it's a bit more complex, and different implementations may differ in details.

Let's focus on the row group structure. First, let's look at the Thrift RowGroup definitions.

struct RowGroup {
  1: required list<ColumnChunk> columns
  2: required i64 total_byte_size
  3: required i64 num_rows
  4: optional list<SortingColumn> sorting_columns
  5: optional i64 file_offset
  6: optional i64 total_compressed_size
  7: optional i16 ordinal
}

Already at this stage, we can see how much information about a specific row group is stored in metadata.
For now, though, let's focus on three fields:

Important: each row group always contains all columns defined in the schema.
Even if throughout the entire group a column contains only null values.

Column Chunks

Let's go deeper and look at the Thrift ColumnChunk definition

struct ColumnChunk {
  1: optional string file_path
  2: required i64 file_offset
  3: optional ColumnMetaData meta_data
  4: optional i64 offset_index_offset
  5: optional i32 offset_index_length
  6: optional i64 column_index_offset
  7: optional i32 column_index_length
  8: optional ColumnCryptoMetaData crypto_metadata
  9: optional binary encrypted_column_metadata
}

struct ColumnMetaData {
  1: required Type type
  2: required list<Encoding> encodings
  3: required list<string> path_in_schema
  4: required CompressionCodec codec
  5: required i64 num_values
  6: required i64 total_uncompressed_size
  7: required i64 total_compressed_size
  8: optional list<KeyValue> key_value_metadata
  9: required i64 data_page_offset
  10: optional i64 index_page_offset
  11: optional i64 dictionary_page_offset
  12: optional Statistics statistics;
  13: optional list<PageEncodingStats> encoding_stats;
  14: optional i64 bloom_filter_offset;
  15: optional i32 bloom_filter_length;
}

Remember: Everything we've looked at so far is still part of metadata.
This means we get all this information about columns, row groups, and the data itself by reading only the end of the file, regardless of whether the file is 1MB or 1TB.

Here we basically reach the place that allows us to read data from the file.
But before that happens, we need to learn about the last data structure necessary for reading.

Data Pages

Pages - another logical division in the Parquet file structure.
Row Group -> Column Chunk -> Data Pages

Reading Parquet really comes down to analyzing the metadata structure, locating the address of the beginning of a specific row group, then a specific column in the group, and then iterating through and reading data from each page.

But before we start reading pages, we need to understand whether we're dealing with a DataPage, IndexPage, or DictionaryPage.

To do this, we first read the PageHeader - the page header, whose Thrift definition looks like this

struct PageHeader {
  1: required PageType type
  2: required i32 uncompressed_page_size
  3: required i32 compressed_page_size
  4: optional i32 crc
  5: optional DataPageHeader data_page_header;
  6: optional IndexPageHeader index_page_header;
  7: optional DictionaryPageHeader dictionary_page_header;
  8: optional DataPageHeaderV2 data_page_header_v2;
}

enum PageType {
  DATA_PAGE = 0;
  INDEX_PAGE = 1;
  DICTIONARY_PAGE = 2;
  DATA_PAGE_V2 = 3;
}

To read the header, we need to know its address relative to the beginning of the file. Here's how we can calculate it for a selected row group and selected column:

  1. Read FileMetadata
  2. Find the appropriate RowGroup and locate the relevant ColumnChunk
  3. Having ColumnChunk, we get the file_offset address of the ColumnChunk beginning relative to the file beginning.

Important: At this stage we don't need to physically load bytes into memory yet.
It's enough to create a stream allowing us to read data directly from the file.

The first thing to read is the header, PageHeader. By doing this via Thrift, passing the stream and setting the appropriate beginning address, we get a PageHeader data structure that will tell us exactly how to read the page itself.

There are 3 types of pages:

DataPage

A page containing binary representation of data from a selected column from rows that ended up in the selected row group.
This is the simplest and most direct type of page. It contains "just" data.

When reading an integer column, what we're interested in is the number of rows in a specific group (each row is one value in a DataPage). So knowing that in this group we have, say, 100 values, we know we need to read 400 bytes (int32 is written on 4 bytes).

Alright, but what if the column is optional? That means it can contain null values.
Here the situation gets a bit more complicated because we need to know which rows contain null values.
Where does this knowledge come from, you ask?
Definition Levels

The situation gets a bit complicated. At the beginning I wrote that DataPage contains only data, and now I'm adding some Definition Levels.

In reality, the data page structure looks roughly like this:

Parquet Data Page: int32
=======================================
[ Repetition Levels ]: 0, 0, 0, 0, 0
---------------------------------------
[ Definition Levels ]: 1, 0, 1, 1, 0
---------------------------------------
[ Values           ]: 42, 73, 19
=======================================

For now, let's focus only on Definition Levels and Values. It's easy to see the relationship between them. The number of Definition Levels and Repetition Levels in each page is always equal to the number of values in the column.
Regardless of whether there are nulls or not. Definition Levels tell us whether a given row contains a value or null.

Based on this, we can easily determine the total number of non-empty Values, which will allow us to read them.
In the above example we have 5 rows, of which 3 are values. Since int32 is written on 4 bytes, we know we need to read a total of 12 bytes.
We also know that when converting the column to rows, the first row will contain value 42, the second row null, the third row 73, the fourth row 19, and the fifth row null.

Important: Repetition Levels and Definition Levels are much more complicated, though. More on this later.

This is roughly what the DataPage structure looks like.

DictionaryPage

Since we keep data in DataPage, what's the purpose of DictionaryPage?
Well, DictionaryPage is a page containing a dictionary of values.
A dictionary used for reading data, especially for columns containing repeatable values.

It works roughly like this: when reading a ColumnChunk, we start with the first page. If this page is a DictionaryPage, we know we're dealing with a dictionary (we actually know this from the beginning, as it's recorded in the column metadata).

If, for example, we're reading a column with high repeatability, like a column with country names, instead of writing the full country name for each row in DataPage, we only write its position in the dictionary.
For such a column, the first page in the column would be DictionaryPage, and subsequent ones would be DataPage.

The difference is that in DataPage, instead of full values, there would be positions in the dictionary, which we'd keep in memory to reconstruct rows.

Important: Each ColumnChunk can contain only one DictionaryPage.

This can provide huge savings. Instead of, say, writing the word Poland in binary 10,000 times, which is 60k bytes, we only write the position in the index (4 bytes), which additionally gets packed using the Run Length Encoding / Bit-Packing Hybrid algorithm. This algorithm, also based on the repeatability of consecutive values, will reduce the total number of bytes needed.

IndexPage

The last type of page is IndexPage.
This page doesn't contain data, so it's not necessary for reading or writing.
Each ColumnChunk can contain only one IndexPage and it's always located at the end, after DictionaryPage and all DataPages.

The purpose of this page is to store statistics regarding ColumnChunk, like Min/Max values, number of nulls, or sorting method for each page in a specific ColumnChunk. This allows quick filtering and finding only specific pages within a given ColumnChunk, significantly speeding up file searching when we're interested in specific information.

Note: Each ColumnChunk in its metadata contains similar statistics as IndexPage, but not for each page but for the entire ColumnChunk.
Thanks to this, we can first skip entire columns that don't interest us and then even specific pages, reducing to an absolute minimum the amount of data we need to read.

Considering that this information is in the file metadata, even the largest Parquet files can be read and filtered lightning-fast even if they're only available over the network.
It's enough that we can read the metadata, then based on it locate a specific row group, then the selected column, and finally specific pages.
This gives us very precise localization of our data, which we can read using the Http Range Header.

This is exactly one of the reasons why Parquet is so powerful. We're no longer talking about brutally downloading and iterating through a gigabyte file. Parquet allows us with surgical precision to download and read only those areas of the file that really interest us.

Dremel

When discussing the DataPage structure, I mentioned Definition Levels and Repetition Levels.

The example I covered was very simple because it concerned a simple column (int32), so Repetition Levels don't apply at all.
The situation changes dramatically when we're dealing with a nested column, like a structure, list, or map. Let's look at an example.

[{"sku":"abc", "quantity": 1, "price": 100}, {"sku":"def", "quantity": 2, "price": 200}]

Going back to the earlier part of this article, specifically to nested types.
We know that our data after flattening will look like this:

We have 3 columns here, each will be in a separate Column Chunk and each will contain one or more pages.

So how, based on these two values (Repetition / Definition Levels), do libraries reading files know how deep in the structure the values are and which element they belong to?
What if our structure looked like this:

[{"sku":"abc", "quantity": 1, "price": 100}, {"sku":null, "quantity": 10, "price": 100}, {"sku":"def", "quantity": 2, "price": 200}] (in the second element, sku has a null value).

What if the structure is much more nested - how do we know which value goes to which nesting level?

The answer to this and many other questions can be found in a document published by Google: Dremel: Interactive Analysis of Web-Scale Datasets which describes how Google stores and searches nested data structures.

The tool used by Google is called Dremel and is a distributed system for searching large datasets.
It's based on 2 algorithms, Shredding and Assembling, which are described very briefly in the above document.

Note: Describing the exact operation of these algorithms is beyond the scope of this already long post.
If there's interest in the topic, though, I'll try to cover this thread in upcoming posts.

These algorithms are based on these 3 definitions:

As we already mentioned, Definition Level determines whether a given row contains a value or not. Repetition Level, which for flat columns is always 0. For structures, it will determine whether a value (or null) should be repeated, and at which nesting level.

Note: Knowledge of how exactly the algorithms from Dremel work isn't necessary for optimal use of Parquet.
For this reason, I won't elaborate on this topic, but if there's interest, I'll try to cover this thread in upcoming posts.

Below I'll just roughly show what flattened data would look like.

        Input:
[
    'items' => [
        ['sku' => 'abc', ...],
        ['sku' => 'def', ...],
    ]
]

Output:
{
  ["sku"] =>
  {
    ["repetition_levels"] => { [0] => int(0) [1] => int(1) }
    ["definition_levels"] => { [0] => int(1) [1] => int(1) }
    ["values"] => { [0] => string(3) "abc" [1] => string(3) "def" }
  }
}
    

So in reality we save 0, 1, 0, 1, "abc", "def" and not just "abc", "def".
It's exactly these additional numbers that tell us how to reconstruct any nested structure.

Interestingly, even repetition levels and definition levels are appropriately packed for optimization using the Run Length Encoding / Bit-Packing Hybrid algorithm.

That's not the end, because not only levels are packed, but the values themselves.
Depending on the column type, values can be packed in different ways. A list of all packing algorithms supported by Parquet (at least in theory) can be found in the official documentation.

And information about which algorithm was used to pack data before writing can be found in metadata, under this path: RowGroups[x].ColumnChunk[y].PageHeader[z].data_page_header.encoding

But this isn't Parquet's last word when it comes to optimization!

Compression

After packing and writing our data in binary form for a specific page, each page is additionally compressed.

Depending on implementation, Parquet allows using different compression algorithms:

A very popular option is Snappy, which offers a very good compromise between speed and compression level.

Tools like Apache Spark use it by default.

Encryption

One of the last interesting features I'd like to discuss is encryption!

Yes, Parquet allows encrypting data - encrypting at multiple levels.

Note: Encryption is one of those features I haven't covered yet in the PHP implementation
For this reason, I won't elaborate on this topic. Once I get a chance to implement this functionality, I'll try to supplement this article.

Encryption in Parquet is based on Parquet Modular Encryption and uses AES for data encryption.

Encryption, especially of selected columns, takes Parquet to a higher level of data storage.
This allows us to relatively easily, with minimal overhead, additionally secure data we store in Parquet files.

Let's imagine Parquet is used to store customer data, where the email and phone columns contain sensitive data.
In this situation, it just begs for those two columns to be additionally secured. Even if someone manages to get physical access to the file, without the key they still won't be able to read the data.

Summary

This is exactly the secret of Parquet and the way to efficiency. Instead of storing arbitrary data in text form, Parquet goes several steps further.
First, it enforces a data schema based on simple yet incredibly flexible types, each of which can be represented in binary form.
Then the binary form is appropriately packed to avoid unnecessary byte repetitions, which is finally additionally compressed using very efficient algorithms.
The cherry on top are advanced and detailed metadata, available at multiple levels, allowing filtering of unnecessary partitions or even entire files without reading their contents.

Moreover, thanks to appropriate logical division, over which we have full control (size of groups and pages), we can decide what's more important for us - speed or memory savings. Searching or reading data, or maybe security, for which we'll use additional encryption?

Parquet is truly a powerful tool that, in the right hands, allows efficient storage and searching of huge amounts of data.

If this post inspired you to experiment with this amazing data format, let me know in the comments!

Help

If you need help building a central data warehouse, I'd be happy to help you.
Contact me, and together we'll create a solution perfectly tailored to your needs.

I also encourage you to visit the Discord - Flow PHP server, where we can talk directly.

Consulting