Hi fellow data engineers,

Currently I’m restructuring a pipeline written with pyspark on Databricks. Since it’s a lot of transformations, results in an extensive DAG, but it’s cool to spend some extra processing resources to make a standard dimensional model (apart from the necessary transformations).

Was wondering what real benefits you have seen a star schema design has from the “one big table” approach, I could preach to my team? (My goal mainly would be to have a resulting smaller PowerBI model.)

And as a side question, what tools do you use to create a dimensional model such a star schema with code?

Thanks a lot!

  • Andy
    link
    fedilink
    English
    arrow-up
    1
    ·
    il y a 1 an

    You can build your dimension tables in isolation of one another, each representing a different business object or domain. There can be the necessary transformations in here including business logic and rules.

    The same dimension will be used across many fact tables, and the business logic will only need to have been written once. If it changes, you only need to update it in one place.

    If the logic is used in multiple OBTs, you would need to write this logic (and maintain it) in each. Alternatively, you would have done it in an interim table. If this is the case, I would suggest to keep with a standard and have that interim table be your facts and dimensions.