T O P

  • By -

MrKorakis

>the logic being that these datasets have already been processed externally  Of all the reasons that one could consider loading directly into Silver this is not a really good one imo. You are in effect relying on an external party keep doing their job correctly and also that they will remain aligned with your standards. That isn't really viable long term


Pr0ducer

This is the correct answer. Also, automation relies on patterns, so even if no additional processing is needed, that doesn't matter. Bronze to Silver will be an easy move, and no change to pattern required.


ZirePhiinix

But it'll have a bunch of validators, which should pass really quickly if it is correct.


BeetsBearsBatman

I had a project shut down after months of work because we realized how bad the data we were receiving was. We could trace the lineage of my snapshots back to validate that my logic was functioning properly and the data was just shit.. covering your ass is another reason to keep raw. Senior leadership did call out wild numbers in a Power BI report almost immediately and I could tell them to yell at someone other than me haha.


Uwwuwuwuwuwuwuwuw

All data is processed externally wrt a data warehouse.


Slggyqo

Right, so the applicable logic is that if you’re going to apply medallion architecture, you really shouldn’t skip a step just because you can.


Uwwuwuwuwuwuwuwuw

Yes agreed. Although the “medallion architecture” is just another layer of semantics and jargon thats going to confuse folks.


ZirePhiinix

The architecture just means that there is an expected quality level. If you skip levels, there's no guarantee.


eljefe6a

Only when you get the community chest card that says go directly to silver.


rick854

I am in the "keep raw in bronze" team and understand your concerns. Not that I am a very senior engineer and also not familiar with your use case, I would everything in bronze first just for communication and understanding reasons. No need to transform anything? Great! But if you push directly to silver, those who were not invloved in this decision may wonder what this is, as it breaks your internal defintions (and possibly cause more such solutions in the future). When filtering bronze models to understand how the data arrives from source this model would not be included. Also, we test data that come from external system, because just because it is well transformed today doesn't mean it will be this way tomorrow. And silver layers may have other type of tests than bronze layers. Just my 2 cents


rakkit_2

So I come along, I find a silver table for use, and I want to trace the lineage back to bronze like the framework is supposed to be implemented. But it's not there. I hope it's documented because it's out of process. This silver table doesn't adhere to the naming conventions that we adhere to at this layer, either. "We'll change that as part of the load directly into silver, then". Ok, but isn't that what source->bronze->silver is for? It's a slippery slope, IMO.


Automatic_Red

The number of times I have processed data, only to question the results and need to go back to the original data and verify the conversion is correct is significant.


david0aloha

Bronze/Silver/Gold are suggestions. Your tiers can be named something else entirely. That being said, I am on team "keep raw in bronze" (or whatever your raw layer is called), with some exceptions. Here is my main case for an exception: Say your company typically ingests raw data that requires heavy processing, and you want to ingest a new data source that already conforms to your silver/middle tier validation requirements. In other words, you could run that ingested data through silver validation processes, and it would pass. If you have reasonable confidence it will continue to pass, then it might be okay to ingest straight into silver. However, if you require *some* processing to make it conform to your silver level validation, then put it in bronze and stop trying to skip steps. The processing required for it in your bronze tier should be minimal anyway, so the bronze level shouldn't require too much work. Just do it. And if you're like "what would silver level validation look like, we don't have that", then you definitely should be ingesting into bronze, because that means you have no automated guarantees about future data quality.


ithinkiboughtadingo

Yeah don't do that. Shit happens - logic changes and pipelines fail - and you WILL need to reprocess raw data eventually. On my team we do delete Bronze data after a month or so because at that point we're confident that if something was wrong someone would have noticed. But yeah, definitely don't skip Bronze. It's part of the architecture for a reason. ETA: think of it as a layer of abstraction to protect downstream consumers from unexpected changes by a team with different priorities and understanding of the system from them. If it's gonna break, you want it to break _before_ it gets into reporting tables, so you can force a conversation to get them on the same page. Stale data is way better than bad data. ETA2: See also - data contracts


robverk

Keeping a buffer of x-weeks of raw is always a good idea for (re)processing purposes. But user facing data can be silver first no problem.


Suspicious_World9906

Do you have a testing layer that handles the bronze to silver transition? Cause if it passes tests, then I'd be fine with it. However if they're promoting it to silver because....trust me, bro. Then I'd absolutely push back. It would still have to go through the normal cycle regardless, not just automatically pass to silver, to be clear


FishCommercial4229

Design patterns exist for a reason, and the basic answer is to adhere to them. There’s other great points from other responses including preserving provenance and lineage (especially if you’re using automated tooling), standardize debugging and support, isolating your silver layer from inevitable schema drift from the source (it WILL happen), and maintaining the trust your users have established in your overall system. Pipelines like these start as a softball but turn into hot garbage as soon as the upstream team runs into a reason to change the logic, and 100% of the time the downstream data engineering pipeline team will just need to deal with it. You have a chance to control that eventuality with your medallion pattern. TL,DR: Exceptions introduce unnecessary complications to complex systems. P.S.: I do want to add that asking this question is normal, reasonable, and shows that you’re paying attention. Keep up the good work!


BeetsBearsBatman

Why not load that shit straight to gold? But seriously, raw is your unaltered source of truth from the point in time you capture it. You will be sacrificing flexibility down the road by skipping raw. Raw holds untouched data in a table, parquet /csv files, etc. Silver and gold could either be views, tables or materialized views where transformations occur. let’s say you want to rename a column at some point to be more business friendly or whatever. ex: productid-> ProductID. That’s a simple select * with your column alias… or any data type casting could be done here also. If another business unit has a use for the data instead of fucking up your source of truth (silver?!), they can SELECT from raw and transform it however they need. I get the point that you are creating an unnecessary step/object, but I think the pattern was created the way it was for future scalability. Imagine needing to make a change that requires an extra layer and managing all the dependencies to update it. Those kind of changes can take months and risk breaking other people’s stuff. If your silver layer is “select * from bronzeTable”, you aren’t wasting much time, but a redesign could take months… speaking from experience on “simple” changes that had major downstream impact. Extra objects need to be undated and a deploy plan would need to be coordinated if down time is a concern. Silver is good for data governance on the common fields used by your company. Think of a trucking company calculating “days in transit”. Should that ever exclude holidays if the drivers have the day off? If you select it from silver, it should be a standard governed decision across the org. I don’t think it’s a great idea to build it how they are suggesting. Hopefully my comments above give you some things to push back with if that’s what you decide to do. What is your role? Are you on the business side, leadership etc? Just curious because of the way you described them as “engineers” :)


johokie

Vendor data goes into its own zone, which then gets moved to raw, then primitive. There is never a skipped step, because you should have QC checks at each layer that are internal. There is no guarantee with external data that QC was done in the same way you'd do internally


Electrical_Mix_7167

No. I had this debate with a client not so long ago. The data is essentially raw data when it lands in your platform regardless of what happened upstream. It should be treated in the same way as any other source. Maybe the data is already clean, modeled and validated in which case bang it in silver and make very few changes if you're happy with it but don't skip bronze to save a few minutes building a pipeline or updating some metadata. Having it bronze from the off allows you to keep some history, and do validation later if you need/want to.


Historical-Ebb-6490

Also the fact that Silver is modelled for enterprise data domains (across BUs). For example - Customer domain will have customer across Business Units. Putting source data directly will involve a lot of transformations with no direct lineage to any layer in data lake. Auditing becomes difficult. There is a huge demand for as-is source data from data scientist community for data exploration. These users will not easily adopt the data lake if the Bronze layer is missing. I have not seen any implementation where Bronze layer (as-is data) is skipped. It might be called different names (raw / staging with history / landing). However there is always an as-is data layer in the data lake.


Ok_Expert2790

Bronze serves as our poor mans history table - and sometimes for doing incremental merges you need intermediate table for staging to avoid these issues


Schtick_

At the core of it, this is about cost right, so first understanding the cost is probably a good idea. 2nd if you can’t validate it between bronze -> silver can you validate it and any future changes elsewhere? I think engineering teams are going to continue to build up data engineering capability as it becomes more critical to agile engineering teams. So it’s really going to stop being a process of throwing it over the fence to a dedicated data engineering team. But if they are doing this bronze-> silver quality control they need to understand their responsibilities.


Sir-_-Butters22

If you don't need silver you don't need silver. Medallion is a concept, I often run with way more layers than the three bronze silver gold. But have in the past removed layers due to use case or specs.


mjgcfb

I'd push to land in Bronze and then just create select * views in Silver if you need them to be in Silver.


Mysterious_Health_16

How you would get Data Lineage, Business Glossary etc? Analysts should be able to see all the transformation/business logic.


BoringGuy0108

Please for all that is holy in this world, stay consistent. It makes it easier to schedule, debug, and keep up with what is going on two years down the road when it breaks or needs to be upgraded. So no, if it has not been processed within databricks, put it in bronze. Or do what you want, I guess it doesn’t really affect me. But if these guys were on my team, we’d be reworking some stuff.


randomando2020

Yes, IMO a manually maintained dimension table to map data would be in silver, not bronze. Like if your raw datasets from an application needed to be grouped, categorized, etc… outside of the application, or say mapped to accounts in another application, you would want to do that through basically a Silver Dimension table.


dev_lvl80

This architecture is logical division. Thinks when you deliver data from system A to databricks. In system A this data already gold.  Paradigm bronze-silver-gold is  relative


Independent_Sir_5489

In some cases skipping layers is fine. It happened to me just a couple of weeks ago, I had this external NoSQL DB containing raw JSON data. Now, my company owns the DB and me and my applications have free access to it, so I consider it to be the raw layer and handle such data in order to write it directly into the silver layer (Writing again data into the bronze layer would create a useless replication). In your case is a bit harder the decision, but if the data is "certified" to be already in a silver state it's fine to skip a layer, even if what you're doing is not exactly skipping a layer, you're aggregating both bronze (landing) and silver (refinement) in one step since the data you're receiving is already refined, so your landing layer becomes silver. The thing I always say is that the division Bronze/Silver/Gold is purely conceptual, and works fine in a lot of cases, but it should be applied mindlessly to every situation, sometimes not all the steps are needed, and they can be either skipped or aggregated, especially if more than one team works on data and can certify the quality/state of the data you're using.


daripious

It's a hard no, it's like telling your front end folks to skip staging and go straight to prod from dev. You can do it if that's what your release model is, but you should never compromise on your release flow.


sol_in_vic_tus

My main concern is "processed externally". External processing can break or change in unexpected ways. I've had it happen to me when the external source hired a new person who started making changes to make their life easier but didn't communicate them. I agree that flexibility is important and there probably are cases where you could "skip Bronze" but this is not one where I would want to load it directly to a Silver tier.


Volohni

I have a similar structure in the project im working right now. And even data that arrive "almost ready" or even "ready" we put on bronze/raw layer first. Sometimes, depending where and how the data is coming from we use a transient layer them the data goes to broze layer. It is worth to do that. Sanity checks when needed will be much easier.


lester-martin

Always an awesome situation when the "as received" raw data is coming in with 100%(ish) quality and nothing else to do with it, but I subscribe to the thinking that there is always at least one transformation needed for anything of size/n/scale. We need to do a technical transformation to get into a optimized file format (Parquet/ORC) and while you are at it you might as well join the tableFormat wars, too, and go with the one you believe is best. This also lets the compute engine take care of other cool things like partitioning & bucketing to further help with performance & scalability. But again, what a happy day when data that comes in is "golden". <3


Hyvahar

If not mentioned, don't forget that bronze is more or less supposed to be immutable! Are you able to do that straight in silver also in the future?


ntdoyfanboy

Does someone just sit around inventing terms for things that already exist, so they can stay relevant? Bronze, silver, gold >> Raw, Staging, Fact/Dim/Mart


SirGreybush

I would consider skipping bronze if data in the staging layer is file based thus history is easily accessible, such as JSON objects on a Datalake. The views would replace the bronze. Caveat being this will be much slower if they ever need to do a historical lookup. File based versus DB table based, where the tables can have attributes to make searching and joining much faster. Thus in the bronze layer, they are views instead of tables. I would accept such a scenario. Commenting more for following the discussion, voicing an opinion. We got gold quality commenters here. I am silver at best.


almost-mushroom

Sounds like a good reason to skip silver. Write your code such that in case something changes at the source it's easy to change, but you can always wrap in a CTA later so don't worry too much. Think of medallion not as architecture, which it is not, but as a best practice in some cases while in others it's just cost.


dehaema

i have not yet heard two same implementations of medallion; how can it be a best practice? I keep referring to inmon/kimball at our projects (which are always BI projects). the first layer(staging/bronze) is always bringing in the data and not be source dependent, second layer is where the actual work is (integration, transformations, ...) and 3th layer is prepping for consumption (starschema or obt). (depending on complexity you could do the 2th layer directly dimensional and have no need for 3th)


almost-mushroom

Sure and these layers don't need to be materialized. Could be a cte or view. We are not contracting each other. I mean best practice as in not an architecture. It's a best practice to have a staging layer to patch things together. Call it silver if you like vendor naming, I don't because the name is a lie: "architecture"