undefined

points

by andrenotgiant3 days ago |

comments

by khurs10 hours ago|

[-]

Both Iceberg and Delta Lake support 'time travel' so you can query data as it was at a certain date.

SELECT count * FROM my_table AS OF "2025-01-01"

https://delta.io/blog/2023-02-01-delta-lake-time-travel/

https://iceberg.apache.org/docs/latest/spark-queries/#spark-...

by _zoltan_10 hours ago|

parent|

[-]

up to a limited number of snapshots. it's not arbitrary times, it's when there was an atomic snapshot created in the catalog.

by hasyimibhar10 hours ago|

prev|

[-]

It wouldn't be possible to do this with LTAP architecture since (I'm assuming) the individual logical changes are not visible. But honestly I've always seen SCD type 2 table as a workaround due to lack of data modeling experience in the source database. If you design your tables correctly, you shouldn't need SCD type 2 downstream.

For example, if you know your user can change emails, and there might be events from another source that is keyed by user email (e.g. marketing-related events), then naturally you will need some sort of email_history table that has historical mapping of user id to email (you probably need it for audit purposes too). Then in this case there is no need to build SCD type 2 table of user from CDC, it's already there.

by TheTaytay3 hours ago|

parent|

[-]

I think you have a point, and SCD type 2 feels like a workaround, but there is also something to be said for the ability to query every row as it was at any given version. I’m not saying that SCD type 2 is the best solution given there might be a more domain-specific way to do it, but I see it a lot like file-based version control. It’s convenient to be able to examine all files as they existed at any point in time, without having to “model” the ways in which those files might change directly into the domain of the individual files.

If you have something like dolt (not affiliated), a version controlled database, you wouldn’t have to slap change dates on anything OR create your historical table. The changes would be implicit in the version history.

by hasyimibhar2 hours ago|

parent|

[-]

> If you have something like dolt (not affiliated), a version controlled database, you wouldn’t have to slap change dates on anything OR create your historical table. The changes would be implicit in the version history.

Nope, even if I have the ability to see the exact changes of each row, I would still add timestamps everywhere, because timestamp of row change does not equal event timestamp. For example, if I have an order table with status column, and I see a CDC event where status changed from in_progress to completed, I cannot simply assume that the CDC timestamp is the timestamp when order was completed. It's possible that the source database received the event late a few minutes late due to delay upstream, or it's backfilling some missed orders a few days ago. Having a completed_at timestamp (and a bunch of other timestamps for each order lifecycle) would eliminate any ambiguities, and your data analyst will thank you for it.

It's the same thing with row history. You cannot simply assume that your row changes are aligned with the logical history of your entity.

by 9 hours ago|

parent|

prev|

[-]

deleted

by eveningtree10 hours ago|

prev|

[-]

Rather than answering directly, I'm thinking about this problem from the other end altogether ever since I saw the dbricks rt demo. Apologies for the rambling response, as I haven't yet finished thinking about this problem...

We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.

(1) Limited by computation - there's only so much data that we can store on disks and nvme

(2) Limited by wallet - disks and nvme are EXPENSIVE

Also, the tight coupling of compute and data didn't help. It limited the size of databases on the individual expensive compute nodes.

So, another question will be -

What's currently stopping me from keeping the scd history tables right in my oltp db? what's forcing me to copy state into my etl/elt pipeline and the process it into scd into a dedicated olap db?

To some extent,the answer is still the same - the oltp cannot scale for the storage size required for keeping historical data. So, I've had to take out the 'cold' historical data and keep it in my olap freezer.

Now, if oltp itself is scaling, I'm not gonna bother with the copying step. I'll just prefer to store the history in oltp itself.

In my perspective (majorly from handling IoT systems), I need olap for 2 reasons - (1) storage scalability, and (2) analytical processing speed

I now consider (1) to be a solved problem

As for (2), I'm still not sure how this architecture ends up matching the query processing speeds of column-oriented storages. But again, I need to study more.

The SCD pipeline still remains in some form. Either in the form of (1) scd rows that we currently keep (etl pipeline) , or (2) as older lsn rows that simply don't get deleted (existing db engine).

I've done quite a lot of experimentation with (2), and it is a pretty solid concept to work with.

I've spent quite a lot of years hammering my brain at databases and datastores in general. And I've now got a feeling that this is it. Finally.

by 10 hours ago|

prev|

[-]

deleted

by ignoreusernames8 hours ago|

prev|

[-]

If safe keeper exposes the changes to the tables somehow, a type2 scd is just a windowed lag over the primary key sorted by the timestamp

by nikita4 hours ago|

parent|

[-]

Safekeepers keep a window of WAL in Postgres WAL format and doesn’t have an external API.

It streams WAL to pageservers and Postgres read replicas