
Data Analytics in Distributed Transactional Systems
In this post, I'll tackle the problem of data analytics in distributed transactional systems.
If you're looking for ideas on building a central data warehouse that will let you gather data from across your entire system,
regardless of its fragmentation, without drowning in operational costs, then this post is for you.
It all starts innocently enough
Most systems we build daily store data in some relational database. PostgreSQL has become an extremely popular and solid choice, practically becoming the industry standard in recent years.
The story of most projects tends to look pretty similar: we start by validating an idea, gain our first users, the system starts generating revenue, business figures out how to increase profits, new features emerge. Each new feature means a few new tables in the database.
To speed up development, we use an ORM, automatically generate migrations that create and update the database schema.
Initially everything goes smoothly, new features bring expected profits, the business starts scaling. We hire more developers to build more features in parallel.
Occasionally someone reports that the system starts "slowing down" in certain places, quick reconnaissance, even quicker diagnosis - there's a missing index in some table.
We add an index in the ORM mapping configuration for a field the system frequently searches on. Problem solved.
The growing team of developers places great emphasis on quality, perhaps even using
advanced software development techniques like Event Storming or Domain-Driven Design.
CI/CD runs countless tests, making sure changes don't introduce regressions.
The idyll continues, the team or perhaps multiple teams start adding new modules to the system. Modules properly isolated, responsible for specific tasks, never crossing their boundaries or stepping on other modules' toes.
For communication, we obviously use queues, implementing the Outbox/Inbox Pattern
To ensure proper isolation, we establish rules stating that each module only has access to database tables that belong to it. To get data from another module, you must go to that module, whether through some internal API or any other way.
Every so often, business comes to us asking can you quickly generate this report for us?. Of course, a few lines of SQL, maybe a few dozen, and the report is ready.
Business is happy, the report in CSV format goes to Excel (the most popular BI tool), business draws conclusions, plans new features and changes.

Time passes, new tables sprout like mushrooms after rain
We can persist in this state for a very long time, even several good years.
Meanwhile, someone somewhere will surely come up with the idea to add reporting capabilities to the system. It's only a matter of time.
System status reports are one of the most crucial tools for business, providing insight into user behaviors, preferences, and trends. They allow not only understanding what's happening but also properly planning what's about to happen.
The better and more detailed the reports, the better decisions can be made based on them. Good business decisions translate to higher profits, higher profits translate to bigger budgets. Bigger budgets translate to better tools, larger teams, better salaries, or bonuses.
It should be in every developer's interest to provide business with the best and most precise data possible, after all, better results directly translate to better profits.
First symptoms
The system is running, generating profits. It consists of about 5, maybe even 10 modules, each module consists of 20-50 tables in the database. Each module provides its own reports.
- Sales
- Marketing
- Logistics
- Inventory Levels
- Users
Each module exposes only part of the data, a piece of the bigger picture, but none provides a view of the whole.
Teams have indeed implemented reference keys to data from other modules, they even managed to create a single place in the user interface from which reports can be generated.
But that's still not enough...
Very quickly it turns out that reports generated in different modules, written by different developers, perhaps even in different technologies, have different data formats, different naming standards.
Date ranges are interpreted differently, one module includes start and end dates, another excludes them, and yet another creates a right-open interval to simplify pagination, because they happen to have an API that uses the same piece of code.
Since each module is independent, has its boundaries, its nomenclature, at some point we realize that what we call something in one module, another module exposes under a completely different name. Because in that module's context, it makes sense.
Over time, we'll probably also realize that each team has defined different data retention and storage policies. Despite having data from the last 5 years in a key module, we can't do anything with it because modules that provided data needed to enrich the basic report only have data from the last 2 years.
These aren't problems that a bit of Excel magic couldn't solve (except perhaps for the missing data). We'll rename these columns, delete those, add a quick filter and that's enough.
We'll create one big file with a sheet named "Dashboard", and all the others will be read-only, feeding the dashboard.
This approach might even work for a while. Maybe even longer than a while, but let's not kid ourselves. It will all eventually fail, and according to Murphy's laws, it will fail at the worst possible moment.
What's wrong with Excel?
Nothing! Excel is a fantastic tool. The problem isn't with Excel, but with how it's used.
All this magic of cleaning and preparing data shouldn't happen in Excel, not on a larger scale. If we're talking about a one-time quick report, no problem. We do what we must, hack together formulas, analyze the data, and forget about it.
But if this is to be part of our daily routine, if we cyclically have to go through the same process, following constant changes and system evolution, sooner or later those spreadsheets will become outdated.
Columns no longer exist or have changed names, new columns have appeared, the data format has changed, or worse, one of the teams managing a module deleted some data without realizing it was being used by some business user somewhere in one of their reports that they open once a quarter.
In the long run, more complex spreadsheets that pull data from automatically generated system reports, which are then merged based on implicit rules, are unmaintainable.
So maybe we'll hook up some BI tool?
Thought many developers who have repeatedly encountered the problem of generating reports.
Take Metabase for example. A free tool that we can set up in minutes using Docker.
Give it access to our database and several or all tables, and through a very user-friendly interface, business will be able to easily and pleasantly generate the most complex reports.
Reports that can contain data from multiple modules simultaneously!
We can even hire a data analyst with basic SQL knowledge who will achieve everything that can't be clicked through with a properly prepared query.
Except this doesn't solve the problem
It just postpones it.
If we look closely at what has changed, only one thing has changed. The tool... We've moved the problem of cleaning and joining data from Excel to Metabase.
Excel has indeed returned to its original role, we can now throw reports downloaded from Metabase into Excel.
But our implicit logic of joining/cleaning data has moved from spreadsheets to SQL queries.
Besides that, all problems remain the same:
- data inconsistency
- naming inconsistency
- lack of unified backward compatibility policy
- lack of unified data retention policy
So maybe we'll establish processes and rules?
Most of the above problems can be solved by implementing appropriate processes and rules.
We can establish naming standards stating that every table in the database must contain a module prefix in its name, and columns are named in lowercase and separated by underscores.
We can establish that each module stores data from the last 5 years (hot storage), everything older is archived (cold storage).
We can establish that date ranges are always treated as right-open intervals.
We can establish that we don't delete any columns from the database, or that before deleting anything, we first enter a transition period during which we show every system user which columns will change and how.
Even if we assume for the sake of discussion that these processes can be globally implemented across several teams, and that these teams will strictly and very carefully follow them, it's not enough...
Database scaling isn't cheap
Especially when we rely on cloud solutions.
Let's imagine a situation where during peak system operation hours (when users generate the most transactions), a business analyst working on their own schedule needs to generate a report based on a typical, SQL query that's thousands of lines long?
The analyst runs the query, the database starts grinding. The query takes 5, 10, 15 minutes. The database starts sweating.
Users bombard the system with new orders (or whatever other operations that generate lots of writes) while the analyst waits for results.
At the same moment, someone from business needs to quickly check several reports, each containing "total row count in the table". There are several such people.
All these operations overlap, our already heavily loaded database can't keep up.
Some user transactions don't go through.
The system is barely breathing. Wait time for the most basic operations is measured in seconds.
And now the cherry on top, when all these Dantean scenes are taking place, when Pager Duty is red-hot from all types and kinds of incidents, when teams are panicking trying to bring the system back to life, DevOps are figuring out how to quickly scale the database...

The CEO starts a presentation for a potential business partner, with whom cooperation is to prove crucial in the company's development strategy...
So maybe we'll just set up a replica?
After all, reports won't overload our transactional database.
We'll indeed double the database maintenance costs, but we'll reduce the risk of system overload and we'll be able to connect our favorite business intelligence tool directly to the replica, giving us real-time data.
Sounds fantastic, but in practice it's not that simple.
Even ignoring potential problems arising from the very nature of replication, the main and fundamental problem I most often encounter is perception.
A developer who generated those tables through ORM mappings will look at database tables completely differently than a data analyst.
The developer will know which tables need to be joined together to get the complete picture. They'll understand the constraints and conditions embedded somewhere in the application code. Above all, the developer knows or at least should be familiar with the system's (its data's) lifecycle.
All this knowledge is usually not available to analysts.
It's like telling someone to look at something through a keyhole. You can certainly see something. Some conclusions can be drawn, but it will be very difficult to reconstruct the whole.
It's enough that we have a JSONB column in the database where we store some data structures. Let's say the system allows 3 valid combinations of the same structure, but one is super rare, so rare that it hasn't occurred in the system yet. Looking at the data, even comprehensively, the analyst simply can't know that there are 3 combinations of one structure. During normalization, they'll account for 2 cases, while the third becomes a ticking time bomb that will explode, as always, at the least expected moment.
In other words, if we have several independent modules in the system. Each with its database, or at least its tables in the database. Which in total gives us 200-300 tables, expecting that an analyst will handle this without problems, won't make mistakes, and reports won't deviate from expectations, is naive to say the least.
Nevertheless, exposing a copy/replica of the database for analysts and giving it a 4-letter name derived from the word "analytics" is still widely used.
BI tools compete to create the best user interface through which reports can be clicked together. They promise that we'll be able to analyze data without SQL.
Yes, this can work, in many places it does work. What we don't talk about loudly though is:
- Problems with backward compatibility and data structure changes
- Problems with properly maintaining/versioning/testing gigantic SQL queries/scripts normalizing data on the fly
- Replicas/Copies generate additional costs
- Reducing replica resources is either impossible or prevents generating reports in acceptable times
Which ultimately affects data quality and the effectiveness of business decision-making.
What are we left with?
Maybe let's first establish what problems we want to solve first:

- Analyzing data/generating reports must have no impact on system operation.
- Data in reports must always be fresh (data delay is acceptable, determined individually)
- Reports must reflect the real, undistorted state of the system
- Data structure must be resistant to regression
- Consistent data retention and archiving policy
1) Resource Separation
This is nothing groundbreaking, if we don't want our system to be exposed to overloads resulting from database abuse through report generation, we need to set up a separate database.
What database to choose for analytics?
This is basically a topic for a separate article or even a series of articles. There are many solutions, some better, some worse. There is no one magical solution to all problems.
My advice, especially for smaller teams inexperienced in data management, is not to jump on technologies you have no experience with.
The key is the appropriate data format. After converting many narrow tables into one wide one, you'll probably
find that generating the same report without using 20x JOIN
no longer takes 10 minutes
but less than half a second.
And what if the problem is aggregations, not joins?
Then, instead of aggregating on the fly, it's better to prepare a table containing this data in aggregated form, not raw.
2) Fresh Data
Okay, but since we're creating a new, independent database, how do we ensure that data in this database is fresh and up to date?
A lot depends on the acceptable delay in data synchronization. Most often it's enough if the analytical database is about 24 hours behind the transactional database. That is, it contains data up to "yesterday", including all of "yesterday".
Why? Because few business decisions are made in the moment. If some decisions must be made in such a short time, then proper automations are built.
If a 24-hour delay is acceptable (sometimes it's not and there are ways to handle that too), it's enough to run synchronization a few times a day. Of course, there's no golden rule here either. Just as there's no rule saying how large a range to synchronize at once.
There is, however, one good practice that facilitates synchronization. It involves making sure that the main tables in the transactional system contain the record creation/modification date.
Having these two pieces of information, we're able to narrow the synchronization window to a specific time period.
How does this look in practice? We can, for example, run the synchronization process every 6 hours, collecting only records changed in
the last 24 hours.
Of course, these are example numbers, these values need to be determined based on the size and behavior of the data.
Why 24 hours? It's an additional safeguard. We could fetch data from just 7 hours, but if for any reason synchronization doesn't execute and we don't catch it, we might lose data.
3) Reflecting System State
My opinion on this topic might seem controversial, but I believe that the best knowledge about data and system or module behavior belongs to the team building that system/module.
It's precisely this team that should be responsible for ensuring that data generated by the system or its part that the team is responsible for, reaches the central data repository.
In other words, it's the team implementing a given functionality that should, based on previously gathered requirements, transform this data into the appropriate format and push it forward.
This is probably the easiest way to ensure that data is complete and developers from a given team are aware that this data is being used somewhere. The analytical data format becomes a kind of contract for them – a contract they must follow.
This isn't much different from an API schema contract.
4) Resistance to Regression
This point is probably the most complex. Proper implementation of data schema evolution is often not so much difficult as troublesome.
In short, the rules look like this:
- Never delete columns
- All columns we add must be
nullable
or have a default value - Column types can only be extended, for example,
int
can be changed tobigint
but not vice versa - Don't change column names
So can't we delete anything?
We can, but not just anyhow. Generally, how and how often we break backward compatibility depends only on us.
If we only use our analytical data source internally and, let's say, the analyst building reports is up to date with system changes, with proper coordination we could add new tables and then delete old ones, giving them time to update reports.
However, if our analytical data source is used for Data Science
, but we work in a
multi-tenancy environment and analytical data/reports are made available to clients, then we must approach the matter completely differently.
Data Storage and Archiving Policy
As I mentioned above, it's very important that data in the analytical database, especially provided by different modules, follows the same rules regarding storage time.
If we keep inventory levels in the system only from the last year, but orders from the last 5 years, analysts won't be able to build a report that contains data from both sources.
This is more a formal than technical problem. It would seem that you just need to agree, but in practice it's not that simple.
To establish a common data storage and archiving policy, you need to consider not only technical aspects, but also legal, business, and analytical ones, which may require compromises.
Examples
Let's now look at a simple ETL process example, whose task is to transfer data from the transactional database to the analytical database.
In this example, I'll use Flow PHP, but this isn't something particularly unique to PHP. In any language, we can build something very similar using any library that facilitates creating CLI applications and some data processing tool.
The example below (in a slightly modified form) comes from a live stream session I had the pleasure of recording with Roland, who runs the Never Code Alone channel. You can find the video material on YouTube by searching for "Flow PHP"
Let's assume that the order format looks something like this:
schema
|-- order_id: ?uuid
|-- seller_id: uuid
|-- created_at: datetime
|-- updated_at: datetime
|-- cancelled_at: ?datetime
|-- discount: ?float
|-- email: string
|-- customer: string
|-- address: map<string, string>
|-- notes: list<string>
|-- items: list<structure{sku: string, quantity: integer, price: float}>
Our goal is to transfer these orders to the analytical database, so let's prepare the schema for both input and target data.
<?php
declare(strict_types=1);
namespace App\DataFrames;
use Flow\ETL\Adapter\Doctrine\DbalMetadata;
use function Flow\ETL\DSL\integer_schema;
use function Flow\ETL\DSL\uuid_schema;
use function Flow\ETL\DSL\datetime_schema;
use function Flow\ETL\DSL\float_schema;
use function Flow\ETL\DSL\string_schema;
use function Flow\ETL\DSL\map_schema;
use function Flow\Types\DSL\type_map;
use function Flow\Types\DSL\type_string;
use function Flow\ETL\DSL\list_schema;
use function Flow\Types\DSL\type_list;
use function Flow\Types\DSL\type_structure;
use function Flow\Types\DSL\type_integer;
use function Flow\Types\DSL\type_float;
use function \Flow\ETL\DSL\schema;
use Flow\ETL\Schema;
final class Orders
{
public static function sourceSchema() : Schema
{
return schema(
uuid_schema("order_id"),
uuid_schema("seller_id"),
datetime_schema("created_at"),
datetime_schema("updated_at", nullable: true),
datetime_schema("cancelled_at", nullable: true),
float_schema("discount", nullable: true),
string_schema("email"),
string_schema("customer"),
map_schema("address", type: type_map(key_type: type_string(), value_type: type_string())),
list_schema("notes", type: type_list(element: type_string())),
list_schema("items", type: type_list(element: type_structure(elements: ["item_id" => type_string(), "sku" => type_string(), "quantity" => type_integer(), "price" => type_float()]))),
);
}
public static function destinationSchema() : Schema
{
return self::sourceSchema()
->replace('updated_at', datetime_schema("updated_at"))
->remove('address')
->add(
string_schema('street', metadata: DbalMetadata::length(2048)),
string_schema('city', metadata: DbalMetadata::length(512)),
string_schema('zip', metadata: DbalMetadata::length(32)),
string_schema('country', metadata: DbalMetadata::length(128)),
)
->remove('items')
->add(
uuid_schema('item_id', metadata: DbalMetadata::primaryKey()),
string_schema('sku', metadata: DbalMetadata::length(64)),
integer_schema('quantity'),
integer_schema('price'),
string_schema('currency', metadata: DbalMetadata::length(3)),
)
;
}
}
Notice that the target table structure is no longer order-oriented, but order item-oriented. Our goal is to unpack order items so that each is a separate row.
Thanks to this, the analyst who will need to generate a report won't have to figure out how to unpack JSON on the fly.
The Address column has also been split into several columns, making the report easier to filter.
Another important transformation is converting price
from float
to int
by multiplying the floating-point value by 100.
The last change will be adding information about what currency the prices are in. But where does this information come from? This is a very important detail resulting from poor implementation. In this particular case, all orders are in dollars. The system knows this, developers know this, but someone looking at the tables in the database without context might not have this knowledge.
Our target structure should look something like this:
schema
|-- order_id: uuid
|-- seller_id: uuid
|-- created_at: datetime
|-- updated_at: datetime
|-- cancelled_at: ?datetime
|-- discount: ?float
|-- email: string
|-- customer: string
|-- notes: list<string>
|-- street: string
|-- city: string
|-- zip: string
|-- country: string
|-- item_id: uuid
|-- sku: string
|-- quantity: integer
|-- price: integer
|-- currency: string
The next step will be to create the appropriate table in the analytical database. We can achieve this relatively easily thanks to the adapter for Doctrine DBAL.
<?php
declare(strict_types=1);
namespace App\Dbal;
use App\DataFrames\Orders;
use App\DataFrames\OrdersCSV;
use Doctrine\DBAL\Schema\Schema;
use Doctrine\Migrations\Provider\SchemaProvider as MigrationsSchemaProvider;
use function Flow\ETL\Adapter\Doctrine\to_dbal_schema_table;
final class SchemaProvider implements MigrationsSchemaProvider
{
public const ANALYTICAL_ORDER_LINE_ITEMS = 'order_line_items';
public function createSchema(): Schema
{
return new Schema(
tables: [
to_dbal_schema_table(Orders::destinationSchema(), self::ANALYTICAL_ORDER_LINE_ITEMS),
]
);
}
}
So in the analytical database, we'll store a "simplified" or "normalized" version of the orders table. Normalization involves unpacking order items and making them separate rows, as well as splitting the "Address" column into several columns.
Let's look at the CLI command that will be responsible for transferring data from the transactional database to the analytical database.
<?php
namespace App\Command;
use App\DataFrames\Orders;
use App\DataFrames\OrdersCSV;
use App\Dbal\SchemaProvider;
use Doctrine\DBAL\Connection;
use Flow\Doctrine\Bulk\Dialect\SqliteInsertOptions;
use Flow\ETL\Rows;
use Symfony\Component\Console\Attribute\AsCommand;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Helper\TableSeparator;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Style\SymfonyStyle;
use function Flow\ETL\Adapter\Doctrine\from_dbal_key_set_qb;
use function Flow\ETL\Adapter\Doctrine\pagination_key_desc;
use function Flow\ETL\Adapter\Doctrine\pagination_key_set;
use function Flow\ETL\Adapter\Doctrine\to_dbal_table_insert;
use function Flow\ETL\DSL\analyze;
use function Flow\ETL\DSL\constraint_unique;
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\DSL\lit;
use function Flow\ETL\DSL\ref;
use function Flow\ETL\DSL\rename_replace;
use function Flow\ETL\DSL\schema_to_ascii;
use function Flow\Types\DSL\type_datetime;
#[AsCommand(
name: 'app:orders:import',
description: 'Import orders from the transactional database to the analytical database.',
)]
class OrdersImportCommand extends Command
{
public function __construct(
private readonly Connection $transactional,
private readonly Connection $analytical,
)
{
parent::__construct();
}
protected function configure()
{
$this->addOption('start-date', null, InputOption::VALUE_REQUIRED, 'Start date for the data pull.', '-24 hours')
->addOption('end-date', null, InputOption::VALUE_REQUIRED, 'End date for the data pull.', 'now')
;
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$io = new SymfonyStyle($input, $output);
$io->title('Importing orders');
$startDate = type_datetime()->cast($input->getOption('start-date'));
$endDate = type_datetime()->cast($input->getOption('end-date'));
$io->progressStart();
$report = data_frame()
->read(
from_dbal_key_set_qb(
$this->transactional,
$this->transactional->createQueryBuilder()
->select('*')
->from(SchemaProvider::ORDERS)
->where('updated_at BETWEEN :start_date AND :end_date')
->setParameter('start_date', $startDate->format('Y-m-d H:i:s'))
->setParameter('end_date', $endDate->format('Y-m-d H:i:s')),
pagination_key_set(
pagination_key_desc('updated_at'),
pagination_key_desc('order_id')
)
)->withSchema(Orders::sourceSchema())
)
->withEntry('_address', ref('address')->unpack())
->renameEach(rename_replace('_address.', ''))
->withEntry('_item', ref('items')->expand())
->withEntry('_item', ref('_item')->unpack())
->renameEach(rename_replace('_item.', ''))
->drop('_item', 'items', 'address')
->withEntry('currency', lit('USD'))
->withEntry('price', ref('price')->multiply(100))
->constrain(constraint_unique('item_id'))
->match(Orders::destinationSchema())
->write(
to_dbal_table_insert(
$this->analytical,
SchemaProvider::ORDER_LINE_ITEMS,
SqliteInsertOptions::fromArray([
'conflict_columns' => ['item_id'],
])
)
)
->run(function (Rows $rows) use ($io) {
$io->progressAdvance($rows->count());
}, analyze: analyze())
;
$io->progressFinish();
$io->newLine();
$io->definitionList(
'Orders Import Summary',
new TableSeparator(),
['Execution time ' => \number_format($report->statistics()->executionTime->highResolutionTime->seconds) . ' seconds'],
['Memory usage ' => \number_format($report->statistics()->memory->max()->inMb()) . ' MB'],
['Rows inserted ' => \number_format($report->statistics()->totalRows())],
);
return Command::SUCCESS;
}
}
Of course, this isn't the most beautiful or even the most correct form. Normally, a CLI command wouldn't contain
the ETL pipeline
definition, but for the example's purposes, it's a good start.
A dedicated central data warehouse is undoubtedly a tempting option, especially in places where lack of visibility prevents efficient decision-making.
Fortunately, this is the kind of functionality that can be added at basically any stage of a project's life.
It may require introducing additional processes and some discipline from teams, but the benefits of such a solution are enormous.
- No fear that analytics will affect system operation
- We have access to all corners of our system, every microservice or module
- Such a central database is the best gift for analysts
- Data Science is no longer about burning time cleaning data
- We can easily and safely connect basically any Business Intelligence tools
- We create a data-driven culture within our organization
Of course, like everything new, such changes might look difficult to introduce. Lack of experience working with data, at least in the initial stages, makes the task seem almost impossible.
However, my experience shows that the hardest part is starting. Once we have:
- A few first
Pipelines
processing data - Several or a dozen schemas for our data
- Some more complex transformations
- Prepared tests
- Established processes and procedures
The work goes almost mechanically.
It's worth remembering, though, that there's no one universal solution that will fit every system. In each case, the approach must be adapted to the specifics of the given system and organization.
How to get started?
If you need help building a central data warehouse, I'd be happy to help.
Contact me, and together we'll create a solution perfectly tailored to your needs.
I also encourage you to visit the Discord - Flow PHP server, where we can talk directly.
