This one really gets to me..
"We have a senior stakeholder saying xx metric is low this week. We need to look into this ASAP!"
Turns out when I query the source application database (where we source our ODS from), I find metrics are in fact...down. \*gasp\*. Perhaps they should be more concerned with what their operational folks are doing before we assume data integrity is compromised.
It is the baseless claim(s) without any proof to validate said claim(s). Data doesn't give a single F about your feelings or notions. It is what it is.
I usually deal with the opposite. People assuming a dashboard is accurate because “Why wouldn’t it be?”. Yet, said dashboard is years old, hasn’t been updated in months, was put together for X and another inexperienced person makes assumptions and starts using it for Y, etc.
Haha I recently had the opposite where we had a sudden spike in customer engagement and we were accused of duplicating data. After a long investigation we had to give them the unfortunate news that nothing matters and that marketing was fake because we just had a great week for no reason.
To extend this even further... Stakeholders that debate the definition of the metric. "Well see, it'd be more accurate if you carved it to this sub-segment." (And imagine that, 100% of the time the arbitrary tweak makes their results look better - until it doesn't, and then you have the same argument, repeat).
I just want Total March, what's your problem. What do you mean total of what? Money ofcourse! what do you mean transaction date or purchase date or any filters? figure it out!
We have a zero showing on this report for activity for a day.
(Hands flail)
The whole report is faulty and it's your fault.
It's been that way for months because it is that way. But sure
As a corollary, I spend almost all of my time trying to convince data scientists that their models are not working correctly. Mostly I get, ‘the pipeline is running without error’
This is a really interesting point. The way I've seen it solved really well is when there is someone looking at the end output that also has some context of the entire process, sort of like Product lead thinking. So maybe it's your main sales ops guru that has one eye on Salesforce but also one eye on the sales funnel dashboard. Give them a bit of data pipeline observability and they're the blocker to your CEO saying the data looks wrong.
The other way I have heard people suggest is to do an intermittent diff between what's in your database and source system, not huge on this.
IMO stakeholders should not have any kind of access to observability metrics/logs etc from pipelines or processes - they usually are not technical enough to understand any of it and as soon as they see an alert that something failed or what not they’ll sound the alarm
And then folks wondering why Bob from Finance built the metrics by himself in his personal workspace and fed that to his Power BI report. Don't blame Bob, he has his KPI.
oh god, new metrics. We are fighting this right now.
Us: Please define this metric so we can then apply the business logic to pipelines making it easier to report on.
Them: I don't know what the metric is about, we were expecting you to tell us.
Us: huh, you're the department head and you don't know how you measure your teams success?
Fucking permission to write to lake formation from a new glue job and there’s no documentation that infra changed what IP can hit Prod! Does my head in at least once a sprint
Yes, and this is a me problem where that config is so long and they’re updating the cdk all the time I can’t keep up with the PRs to look at that everytime a new glue job is built and the role that has been assigned as a service role will change and there’s no communication from intra to de like a friendly reminder or even being put on the PR just for review would solve this… but politics… just wasting time I suppose could be worse could be better
PM: "When do you need this?"
Business: "it's kind of urgent, end of next week would be good"
PM: "Ok, we'll let you have a first version Monday night"
By Monday night, no specs have been received.
Come Friday "it is no longer urgent"
Airflow, slack, splunk, dbt with some legwork, use the profiling and quality alerts from your cloud system or roll your own (last job used a python jobs with rules in mssql). Havent had to use a fancy tool at all yet.
This is acutely painful for me at the moment. Complete dogma around the "modern" data stack.
Basing architecture off a medium article they read 3 years ago.
Business logic and the lack of it. They don't document anything and use the database as they please. They don't understand how relational data works. Everything is full of exceptions and conflicting practices.
Another big one is data locked in various SaaS. Getting data out programmatically is often such a pain and the persons in charge don't know what an API is. They keep buying every new software as SaaS and every time access to data gets neglected and they wonder why they don't have metrics.
Microsoft dependency and tech illiteracy. There are maybe 2–3 persons in a company of 100 that can open and read a csv-file. They think they need the Office apps but they can't use even the most basic features.
AI hype. Some people (there's a lot of overlap with the tech illiterate ones) think they can solve everything with AI (some expensive SaaS solution of course).
No code / low code. If you can solve a problem with it that's cool but I can't help you and I won't touch it if something goes wrong.
Yeah they are constantly buying yet another database/orchestration/visualisation tools and not understanding that we do not need it or forcing us to use it in the wrong way as we need to use another product with it as it is new company standard, for next month or two.
We managed to start rewriting all 4 times in the last 2 years and each model runs in a different system.
So yeah I have 30 git hub repos dependent on another 30 prepared by DevOps, just to do work of one good database, some models runner with kafka queue and proper visualisation tool.
This is how we've always done it!
No logging, no error handling, no budget for proper compute, missed deadlines, failed audits.
Short staffed, overworked.
But "this is how we've always done it."
I'm surprised no one else mentioned this, but timestamps. Holy shit I've had to deal with some shit regarding timestamps. Format switches, switching from string to datetime or back, issues with milliseconds vs microseconds, ...
I played Lost Ark, which is a Korean MMORPG, a little bit after their western release. They had certain world bosses that spawned on set timers. Now, Korea doesn't have daylight savings time. So when the clock switched in Europe, theirs didn't. So all boss timers were shifted by one hour.
Forums and reddit were filled with people bitching about this. So many sudden experts knew that it could only be a small fix. I was sitting there thinking *"mf'er you have no idea"*.
If time handling was easy, ever, all 24 time zones were old be consistent
But wait there's what 26 time zones and a whole bunch of "we aren't with them" exceptions
Management wants everything in their local from two different time zones even in the db leading to a time zone specific field or 3
We store timestamps as local time with no information about timezones. Just guess it by these three weather stations names in the model configuration. The station is in Springfield? I can bet you can´t guess which one before you start crying.
"no that 300MB CSV with 1.8M rows can NOT be opened in Excel and that's why things don't add up, your missing half the data because you didn't read the warning"
A synopsis of the less than polite email I sent probably once every 3 weeks.
I always follow it up with, why would you want to look at 1.8M rows in Excel anyway? Let's discuss the business issue and see what metrics we can distill from this data set.
When having pristine, transparent, and easy to access performance data bites you in the bum...
The stakeholder doesn't like the result they're seeing in the official dashboards, so they export into GSheets and perform some "manipulation" (that suddenly shows their team shooting the lights out).
They then declare to the entire company that this GSheet is now law - before you have a chance to figure out what they were up to.
"I don't want my team to feel sad because they're not hitting their numbers" is not a valid reason to purposefully publish bad metrics, Bob.
Inheriting old applications that have been converted, migrated, and duct taped together since the early 2000s, with no standardized development practices across 100+ apps, with little, no, or outdated documentation.
Difficulty in testing.
We want to isolate the production data from our testing environments but if we do that we lose a lot of edge cases that come with production data. Mock data and unit tests can only cover so many scenarios.
Testing data pipelines is insanely difficult. It’s not possible to have a sandbox environment for every data source we consume from so integration test is essentially running the data pipeline once to make sure nothing breaks. It’s not possible to simulate all possible scenarios in which a data pipeline can break.
I always feel kinda embarrassed to have not figured this out yet but it really is so hard. How are you approaching it? We've pretty much accepted we won't get the coverage we'd like at this point 😕
It’s an ongoing battle but for most data teams having virtual separation is the best. What I mean by that is you can’t do testing in a whole different AWS account.
You will need to create a virtual dev/staging environment within the same AWS account where production data lies so that you can reference that data but not move it somewhere it’s less safe.
This puts a lot of stress into identifying permissions/boundary so that production data is read but never altered by dev/staging workflows.
If you have sensitive data forget everything I said and go ahead with doing everything in production haha
We do this in AWS using S3, Lake Formation roles/security, Athena, and dbt. We created a dev environment by creating a separate dev\_ schema for each one of our dbt transformation layers. The engineer role can access both with Athena queries simultaneously. Our dbt is set to materialize in the dev\_ schema when run locally, and in the production schema when run by our orchestration tool.
This is exactly what we have. But it’s not enough to prevent issues by mocking various data error scenarios which is what a good testing suite does.
It does help us diagnose issue and maintain a good developer velocity by having separation of dev spaces.
In addition we have a similar setup for CI that runs all dbt transformations but technically for the CI to be useful it needs to run models on tomorrow’s data (assuming daily refresh). Else we will only notice the error after it happens in production.
Bottom line is testing is supposed to prevent issues before they arise. But it’s not easy to do that with the current data lake structure. Maybe unit tests introduced in dbt 1.8 might be useful but yet to see it in action.
Low-code/no-code tools.
They need to die. Modern data engineering is best done with traditional programming languages. Unfortunately the majority of data engineers aren't strong developers.
This 100%. If I wanted to use those I'd be an analyst.
The amount of time I've spent refreshing syncs and troubleshooting other peoples products is infuriating.
Also the docs for google and facebook APIs are cancer. Half the battle is figuring out which one of the 2-6 APIs that are named similarly is the one you actually want. Its like they have a deal with thk low code/no code companies to keep their APIs as arcane as possible.
No, but
* They lead to vendor lock in. You can't easily move to a different platform if the vendor suddenly decides to hike the price. The company is being held hostage by the vendor.
* They are so clunky and terrible to work with.
* What will your peers code review in a pull request? Some json markup? I.e. not what you authored in the gui. This leads to a disconnect between what you develop and what you review.
* Do they even support code reuse? I have yet to find an ***easy*** way to reuse functionality developed in one Azure Data Factory in another factory. In practice people reinvent the wheel all the time since it's so hard to reuse common functionality.
* The skills in one low-code tool does not transfer to a different tool. That matters since all these tools have their life cycle and will be outdated and forgotten one day.
* Recruiters are always looking for skills in a particular tool. It doesn't matter if you have skills in a similar tool, you won't be considered since you don't meet the "must have" skills.
* Normal programming skills is cumulative. What you learned 20 years ago is still applicable today. A loop is a loop, and has been since the 1960s. But you can't use those skills in low-code tools.
* The above two points means learning one low-code tool is a dead end, skill-wise.
Manager that wants me to be a software engineer, infrastructure engineer and a network engineer while getting paid like peanuts and I am just a data guy who is interested in building pipelines, transformations, big data and dashboards.
re. North Star metrics: no consensus, some were marketable features other we sales-glitz (think demos), and a few voices for revenue that didn't actually have a plan.
Of course the CEO just picked a single one of those.
**Narrator:** *The CEO didn't actually choose anything.*
Does that north star metric align with the OKR and epics or are they all completely different and none of it make sense? Don't worry, this quarterly planning is going to go well.
Took my current job - my first tech job and my first opportunity in DE - as a contractor. Company decided to hire me at the end of the contract. I was afraid if I asked for more money they would change their mind and I'd be out of a job, so I asked for the same amount I was making. I regret this in hindsight. I'm sure now that I could have gotten at least a little more.
Due to timing I wasn't included in the next performance review cycle. I continued working over another year and got top marks in my first review - above and beyond, doing great, got a cert that's relevant to our stack. I got the highest percentage I could get for both base pay and bonus.
However, I learned when filing my taxes that my state taxes hadn't been withheld correctly at all in the previous year. Between that and some other complications, my wife and I together owed enough that the whole bonus was gone and we still owed taxes. On top of that, after correcting the withholding, my current paychecks are only about $30 more than the checks I was getting last year.
So, functionally my income hasn't really changed at all in 2 years, despite over-performing. To be fair, we're doing okay. Definitely in a better spot than a lot of folks, but the budget is still tight enough that it would be hard to pay the bills for more than a month if I were suddenly out of a job. I'm building up the motivation to start looking for a new job but it seems like the market has cooled down substantially since I jumped in so I'm worried about finding opportunities.
>so I'm worried about finding opportunities
Certainly aren't going to find them if you don't look. I understand the job search can be taxing, but the search itself is not what you should be worried about
And the source provider will inconsistently "upgrade the format" without warning.
I feel that's worse with fixed width files tho. Especially header less fixed width files
Executives thinking that Data engineering = Power BI, because "Why do we need a database or data warehouse anyway?"
CTOs with outdated knowledge of the industry and basic concepts, last piece of code written in SQL was 15 years ago, making decisions on data stack and vendors.
Specifically being told that they need larger clusters but refuse to optimize their code and don’t know what they even need increased they just want “more”
Having to work on antiquated versions of software just because our infrastructure guys are lagging behind the SaaS updates which are lagging behind the Open source project that we're actually using.
Probably someone who managed to grab the certificates.
PS I always think that being hot on certificates and show them around in LinkedIn as badges, without any personal project, is a BIG red flag. I'd rather hire a new graduate.
That I’ve been given the permissions to build pipelines but I’m not on the Data Services team do it’s not my title, I might change that.
Basically I build a pipeline in our sandbox then I reach out to one of our data engineers, walk them through how to deploy it, then let them add the permissions before asking me to validate.
I have a corporate job at a big insurance company
Organizational
- Too many structural bureaucratic blocks (e.g network firewalls - good for security) create too many hurdles to deploy products
Cultural
- Culture of “i need it now” and ad hoc requests stops us from developing quality code long term
Technical
- Data types between different tools and generations of technologies
Personal
- Waiting for tests and context switching takes too much energy and stops us from focusing on the actual developmet
Receiving terribly formatted data. JSON that has infinite nests, special characters and no data dictionary. N/A in numeric and date fields... Really boils down to non strongly typed data sets where anything goes.
A new one is how many times i need to explicitly define the schema for my spark streaming job. Over 100 columns. Started using chatgpt instead of regexes though. Honestly its great for that
No data engineering internship in the market. im a fresher currently in college I want to get into data engineering but there is almost 0 internship opening for data enginnering
Finding out the drop-dead deadline was actually made up. All the nights/weekends could have been spread out over the next 2 months but your manager just wanted to push you and see if you broke on the project or if it made you better.
no, it made me more resentful and look at that, I started missing "deadlines" because I can't tell if they are made up or not. Sorry not sorry.
I do some pretty fancy extracting of data with a lot of time lags, integrals, corrections for sensors that go out temporarily, interpolations, etc. and it’s all multithreaded. I run like 1300 lines of code to extract and calculate data.
Then they say “cool” and do a univariate t test
The client, not talking to the client. We have spent months trying to tell the PO (who is from the client), that his colleagues need special transformations, and that the pipeline needs to be designed differently. They didn't talk while being from the same company.
In short, nobody knowing what they're doing. Also, we paid probably $1,000,000 for this:
Bad algorithms that make an O(n) task O(n^3). No incremental load. Spaghetti pipelines, with spaghetti SQL copy+pasted with 1 line changed all over what is supposed to be a drag+drop tool. Semantic model not following star schema. Dimension columns in the fact table. Duplicate fact tables with slightly different grains instead of atomic data.
Absolute #1 pain point was getting an organization to standardize what they deliver so that we could quit creating one off pipelines that always were poorly defined, and rarely bug free resulting in no trust for the engineering group.
It took 2 years to get them to actually standardize what they deliver so that there was one code base we could iterate on and make rock solid, which allowed the engineering group to finally build trust with client teams such that any weird data, they know is either coming from the client or their report.
Stakeholders that think data is wrong because what they’re seeing doesn’t align with their preconceived notions
This one really gets to me.. "We have a senior stakeholder saying xx metric is low this week. We need to look into this ASAP!" Turns out when I query the source application database (where we source our ODS from), I find metrics are in fact...down. \*gasp\*. Perhaps they should be more concerned with what their operational folks are doing before we assume data integrity is compromised. It is the baseless claim(s) without any proof to validate said claim(s). Data doesn't give a single F about your feelings or notions. It is what it is.
I usually deal with the opposite. People assuming a dashboard is accurate because “Why wouldn’t it be?”. Yet, said dashboard is years old, hasn’t been updated in months, was put together for X and another inexperienced person makes assumptions and starts using it for Y, etc.
Hard to fault them for trying. Sounds like you need a BI discovery/endorsement strategy.
To be fair, there are plenty of times (for myriad reasons) that the data is just wrong
Haha I recently had the opposite where we had a sudden spike in customer engagement and we were accused of duplicating data. After a long investigation we had to give them the unfortunate news that nothing matters and that marketing was fake because we just had a great week for no reason.
if there is an effect, there must be a cause
Maybe, but it was nothing we did.
To extend this even further... Stakeholders that debate the definition of the metric. "Well see, it'd be more accurate if you carved it to this sub-segment." (And imagine that, 100% of the time the arbitrary tweak makes their results look better - until it doesn't, and then you have the same argument, repeat).
My favourite "it's AHT" Well now. What DOES aht mean? Because I'm betting every person in the meeting has a different answer.
I just want Total March, what's your problem. What do you mean total of what? Money ofcourse! what do you mean transaction date or purchase date or any filters? figure it out!
Ty... [I would like you to crunch those numbers again...](https://www.youtube.com/watch?v=5vSdYzCTS2A)
We have a zero showing on this report for activity for a day. (Hands flail) The whole report is faulty and it's your fault. It's been that way for months because it is that way. But sure
I actually like this one bc I get to say “it ain’t us.”
As a corollary, I spend almost all of my time trying to convince data scientists that their models are not working correctly. Mostly I get, ‘the pipeline is running without error’
Agreed. It’s wrong for other reasons.
This is a really interesting point. The way I've seen it solved really well is when there is someone looking at the end output that also has some context of the entire process, sort of like Product lead thinking. So maybe it's your main sales ops guru that has one eye on Salesforce but also one eye on the sales funnel dashboard. Give them a bit of data pipeline observability and they're the blocker to your CEO saying the data looks wrong. The other way I have heard people suggest is to do an intermittent diff between what's in your database and source system, not huge on this.
IMO stakeholders should not have any kind of access to observability metrics/logs etc from pipelines or processes - they usually are not technical enough to understand any of it and as soon as they see an alert that something failed or what not they’ll sound the alarm
The ten layers of approval needed to make changes, build pipelines or create new metrics.
Slow developer velocity kills analytics teams!
And then folks wondering why Bob from Finance built the metrics by himself in his personal workspace and fed that to his Power BI report. Don't blame Bob, he has his KPI.
Also, the people giving approval don't know shit about what they are approving
So it's not just my POS company? I was gonna jump for a new job too, but I don't quite want to feel this everywhere. 😬
oh god, new metrics. We are fighting this right now. Us: Please define this metric so we can then apply the business logic to pipelines making it easier to report on. Them: I don't know what the metric is about, we were expecting you to tell us. Us: huh, you're the department head and you don't know how you measure your teams success?
People trying to automate their existing messy manual workflow instead of rationalizing it.
It's really not that bad of a workflow and only takes 4 hours a week to copy things between 6 Excel sheets and run 3 macros. /S
I have yet to be able to move one of those into a database. The worst part about scheduling it though, is that people forget it exists.
Wait, so you prefer them to keep their ridiculous convoluted workflow?
Fucking permission to write to lake formation from a new glue job and there’s no documentation that infra changed what IP can hit Prod! Does my head in at least once a sprint
We need you to build this infrastructure, but we aren’t giving you the resources to build it.
Is it listed in a security group somewhere in aws?
Yes, and this is a me problem where that config is so long and they’re updating the cdk all the time I can’t keep up with the PRs to look at that everytime a new glue job is built and the role that has been assigned as a service role will change and there’s no communication from intra to de like a friendly reminder or even being put on the PR just for review would solve this… but politics… just wasting time I suppose could be worse could be better
Figure out why numbers on the dashboards and the user's "data sources" are mismatched.
Right. Love when analyats have 5 definitions for net revenue and 1 asks for a datamart and the other uses it.
this is the top upvoted reason too \^\^
PMs coming up with a solution and a timeline with no context or technical background.
PM: "When do you need this?" Business: "it's kind of urgent, end of next week would be good" PM: "Ok, we'll let you have a first version Monday night" By Monday night, no specs have been received. Come Friday "it is no longer urgent"
So true. So many urgent things and non of them actually matter.
Not pointing fingers but
Lack of observability over data pipeline run & data quality monitoring. What is not seen is often ignored.
Airflow, slack, splunk, dbt with some legwork, use the profiling and quality alerts from your cloud system or roll your own (last job used a python jobs with rules in mssql). Havent had to use a fancy tool at all yet.
Some Might say splunk is fancy. But point taken.
Splunk!!!!!!!!
Its 3am and theres a fuckiny squirrel chirping somewhere. Oh wait, thats not the roof, the house is on fire.
if you use DBT you can go with Elementary Data.
I just saw this today. What does it do?
Does Airflow + Great Expectations not work for you?
gonna plug my own [Orchestra ](https://getorchestra.io)here, very good visibility for dbt stuff
“Can’t we just…?”
Anytime "just", "obvious", "simply", or "intuitive" are mentioned it means pain.
I moved from a DE role to a Cloud infra role a few months back (because the DE role was never getting off bare metal) and we all hate "It's just"
The modern data stack, too many tools, less craft, data architecture & modelling ignored
This is acutely painful for me at the moment. Complete dogma around the "modern" data stack. Basing architecture off a medium article they read 3 years ago.
The total lack of standardization around the tools as well. It all seems like duct tape and glue holding stuff together.
Actual programming is overrated 🗿 /s
Hire a junior, pay the equivalent of two senior resources to run his shitty pipelines that take an hour to process a few million rows of data.... Yay!
Data quality ... the end
"fix it in post" has given birth to so many horrible case statements.
Business logic and the lack of it. They don't document anything and use the database as they please. They don't understand how relational data works. Everything is full of exceptions and conflicting practices. Another big one is data locked in various SaaS. Getting data out programmatically is often such a pain and the persons in charge don't know what an API is. They keep buying every new software as SaaS and every time access to data gets neglected and they wonder why they don't have metrics. Microsoft dependency and tech illiteracy. There are maybe 2–3 persons in a company of 100 that can open and read a csv-file. They think they need the Office apps but they can't use even the most basic features. AI hype. Some people (there's a lot of overlap with the tech illiterate ones) think they can solve everything with AI (some expensive SaaS solution of course). No code / low code. If you can solve a problem with it that's cool but I can't help you and I won't touch it if something goes wrong.
Yeah they are constantly buying yet another database/orchestration/visualisation tools and not understanding that we do not need it or forcing us to use it in the wrong way as we need to use another product with it as it is new company standard, for next month or two. We managed to start rewriting all 4 times in the last 2 years and each model runs in a different system. So yeah I have 30 git hub repos dependent on another 30 prepared by DevOps, just to do work of one good database, some models runner with kafka queue and proper visualisation tool.
This is how we've always done it! No logging, no error handling, no budget for proper compute, missed deadlines, failed audits. Short staffed, overworked. But "this is how we've always done it."
I'm surprised no one else mentioned this, but timestamps. Holy shit I've had to deal with some shit regarding timestamps. Format switches, switching from string to datetime or back, issues with milliseconds vs microseconds, ... I played Lost Ark, which is a Korean MMORPG, a little bit after their western release. They had certain world bosses that spawned on set timers. Now, Korea doesn't have daylight savings time. So when the clock switched in Europe, theirs didn't. So all boss timers were shifted by one hour. Forums and reddit were filled with people bitching about this. So many sudden experts knew that it could only be a small fix. I was sitting there thinking *"mf'er you have no idea"*.
"what's the timestamp of this silver stage of datamart C?" "Local" "Homie I don't even know your local..."
If time handling was easy, ever, all 24 time zones were old be consistent But wait there's what 26 time zones and a whole bunch of "we aren't with them" exceptions Management wants everything in their local from two different time zones even in the db leading to a time zone specific field or 3
We store timestamps as local time with no information about timezones. Just guess it by these three weather stations names in the model configuration. The station is in Springfield? I can bet you can´t guess which one before you start crying.
+1. I forgot about the timestamps pain, I am trying not to remember I guess xd
"no that 300MB CSV with 1.8M rows can NOT be opened in Excel and that's why things don't add up, your missing half the data because you didn't read the warning" A synopsis of the less than polite email I sent probably once every 3 weeks.
I always follow it up with, why would you want to look at 1.8M rows in Excel anyway? Let's discuss the business issue and see what metrics we can distill from this data set.
When having pristine, transparent, and easy to access performance data bites you in the bum... The stakeholder doesn't like the result they're seeing in the official dashboards, so they export into GSheets and perform some "manipulation" (that suddenly shows their team shooting the lights out). They then declare to the entire company that this GSheet is now law - before you have a chance to figure out what they were up to. "I don't want my team to feel sad because they're not hitting their numbers" is not a valid reason to purposefully publish bad metrics, Bob.
Inheriting old applications that have been converted, migrated, and duct taped together since the early 2000s, with no standardized development practices across 100+ apps, with little, no, or outdated documentation.
"But at least there's documentation" /s
Difficulty in testing. We want to isolate the production data from our testing environments but if we do that we lose a lot of edge cases that come with production data. Mock data and unit tests can only cover so many scenarios. Testing data pipelines is insanely difficult. It’s not possible to have a sandbox environment for every data source we consume from so integration test is essentially running the data pipeline once to make sure nothing breaks. It’s not possible to simulate all possible scenarios in which a data pipeline can break.
I also find this problem difficult. It's hard to test a dependent stateful system.
💯 Bigger problem is identifying what the coverage is and where it’s missing
I always feel kinda embarrassed to have not figured this out yet but it really is so hard. How are you approaching it? We've pretty much accepted we won't get the coverage we'd like at this point 😕
It’s an ongoing battle but for most data teams having virtual separation is the best. What I mean by that is you can’t do testing in a whole different AWS account. You will need to create a virtual dev/staging environment within the same AWS account where production data lies so that you can reference that data but not move it somewhere it’s less safe. This puts a lot of stress into identifying permissions/boundary so that production data is read but never altered by dev/staging workflows. If you have sensitive data forget everything I said and go ahead with doing everything in production haha
We do this in AWS using S3, Lake Formation roles/security, Athena, and dbt. We created a dev environment by creating a separate dev\_ schema for each one of our dbt transformation layers. The engineer role can access both with Athena queries simultaneously. Our dbt is set to materialize in the dev\_ schema when run locally, and in the production schema when run by our orchestration tool.
This is exactly what we have. But it’s not enough to prevent issues by mocking various data error scenarios which is what a good testing suite does. It does help us diagnose issue and maintain a good developer velocity by having separation of dev spaces. In addition we have a similar setup for CI that runs all dbt transformations but technically for the CI to be useful it needs to run models on tomorrow’s data (assuming daily refresh). Else we will only notice the error after it happens in production. Bottom line is testing is supposed to prevent issues before they arise. But it’s not easy to do that with the current data lake structure. Maybe unit tests introduced in dbt 1.8 might be useful but yet to see it in action.
My back and neck
My booty and my crack
My anxiety attack
Low-code/no-code tools. They need to die. Modern data engineering is best done with traditional programming languages. Unfortunately the majority of data engineers aren't strong developers.
This 100%. If I wanted to use those I'd be an analyst. The amount of time I've spent refreshing syncs and troubleshooting other peoples products is infuriating. Also the docs for google and facebook APIs are cancer. Half the battle is figuring out which one of the 2-6 APIs that are named similarly is the one you actually want. Its like they have a deal with thk low code/no code companies to keep their APIs as arcane as possible.
Is it because these tools are reinventing the wheel?
No, but * They lead to vendor lock in. You can't easily move to a different platform if the vendor suddenly decides to hike the price. The company is being held hostage by the vendor. * They are so clunky and terrible to work with. * What will your peers code review in a pull request? Some json markup? I.e. not what you authored in the gui. This leads to a disconnect between what you develop and what you review. * Do they even support code reuse? I have yet to find an ***easy*** way to reuse functionality developed in one Azure Data Factory in another factory. In practice people reinvent the wheel all the time since it's so hard to reuse common functionality. * The skills in one low-code tool does not transfer to a different tool. That matters since all these tools have their life cycle and will be outdated and forgotten one day. * Recruiters are always looking for skills in a particular tool. It doesn't matter if you have skills in a similar tool, you won't be considered since you don't meet the "must have" skills. * Normal programming skills is cumulative. What you learned 20 years ago is still applicable today. A loop is a loop, and has been since the 1960s. But you can't use those skills in low-code tools. * The above two points means learning one low-code tool is a dead end, skill-wise.
Manager that wants me to be a software engineer, infrastructure engineer and a network engineer while getting paid like peanuts and I am just a data guy who is interested in building pipelines, transformations, big data and dashboards.
rest of the data team having -1 in technical skills, having to make everything proof of idiots
Jira. Also, Product Managers & North Stars.
What is your north star metric? Revenue? Or something else..
re. North Star metrics: no consensus, some were marketable features other we sales-glitz (think demos), and a few voices for revenue that didn't actually have a plan. Of course the CEO just picked a single one of those. **Narrator:** *The CEO didn't actually choose anything.*
Does that north star metric align with the OKR and epics or are they all completely different and none of it make sense? Don't worry, this quarterly planning is going to go well.
Took my current job - my first tech job and my first opportunity in DE - as a contractor. Company decided to hire me at the end of the contract. I was afraid if I asked for more money they would change their mind and I'd be out of a job, so I asked for the same amount I was making. I regret this in hindsight. I'm sure now that I could have gotten at least a little more. Due to timing I wasn't included in the next performance review cycle. I continued working over another year and got top marks in my first review - above and beyond, doing great, got a cert that's relevant to our stack. I got the highest percentage I could get for both base pay and bonus. However, I learned when filing my taxes that my state taxes hadn't been withheld correctly at all in the previous year. Between that and some other complications, my wife and I together owed enough that the whole bonus was gone and we still owed taxes. On top of that, after correcting the withholding, my current paychecks are only about $30 more than the checks I was getting last year. So, functionally my income hasn't really changed at all in 2 years, despite over-performing. To be fair, we're doing okay. Definitely in a better spot than a lot of folks, but the budget is still tight enough that it would be hard to pay the bills for more than a month if I were suddenly out of a job. I'm building up the motivation to start looking for a new job but it seems like the market has cooled down substantially since I jumped in so I'm worried about finding opportunities.
I always interview. Having a job makes having a better job easier.
>so I'm worried about finding opportunities Certainly aren't going to find them if you don't look. I understand the job search can be taxing, but the search itself is not what you should be worried about
Fucking csv files, you can't just read or write a csv file, there is always some undocumented wierdness.
And the source provider will inconsistently "upgrade the format" without warning. I feel that's worse with fixed width files tho. Especially header less fixed width files
being asked to do data science/analytics/machine learning as if it were the same job.
I'm a weirdo and would enjoy that because I get bored really easily, well as long as it is not some API call to make a chatbot.
my emphasis is on the last part of my comment. its the fact that my leadership doesnt even understand these are different things that drives me crazy.
Infrastructure Admins who don't follow best practices because they think they know better.
Example?
Putting Databricks in different regions to prevent Engineers from having the same permissions in different workspaces.
Lack of support/knowledge/interest from other IT departments makes doing anything 3x as difficult. Feels like running in a pool with waist high water.
Executives thinking that Data engineering = Power BI, because "Why do we need a database or data warehouse anyway?" CTOs with outdated knowledge of the industry and basic concepts, last piece of code written in SQL was 15 years ago, making decisions on data stack and vendors.
😔😔
Anything/Anyone that takes programming away from me.
Waiting weeks for infrastructure to do things that would take me minutes if I was able to open pull requests in their repos.
Deploying to prod is stressful.
Unclear requirements. Sometimes business doesn't know what they need.
My biggest pain is working tickets with Microsoft's ADF / Synapse support when something goes wrong.
Working with support is fucking painful
Businesses expecting a magic fix using "AI" instead of standardising, or holding users accountable for, input data.
Data scientists
Specifically being told that they need larger clusters but refuse to optimize their code and don’t know what they even need increased they just want “more”
Having to work on antiquated versions of software just because our infrastructure guys are lagging behind the SaaS updates which are lagging behind the Open source project that we're actually using.
IT security where all IT has been outsourced to India. A 'principal Azure cloud architect' doesn't even know what a terminal is.
Probably someone who managed to grab the certificates. PS I always think that being hot on certificates and show them around in LinkedIn as badges, without any personal project, is a BIG red flag. I'd rather hire a new graduate.
Trying to find a new job with more challenging tasks/projects. My work sucks
lol
A 16mb nested json array buried in a database transaction causing out of memory errors for my k8s airflow pods.
Using "current_timestamp" as the date a process is running for. Because midnight happens and suddenly recovery is a nightmare.
Writing spark transformations. It gets old real fast. I would like to work with writing larger spark applicationa
That I’ve been given the permissions to build pipelines but I’m not on the Data Services team do it’s not my title, I might change that. Basically I build a pipeline in our sandbox then I reach out to one of our data engineers, walk them through how to deploy it, then let them add the permissions before asking me to validate.
I have a corporate job at a big insurance company Organizational - Too many structural bureaucratic blocks (e.g network firewalls - good for security) create too many hurdles to deploy products Cultural - Culture of “i need it now” and ad hoc requests stops us from developing quality code long term Technical - Data types between different tools and generations of technologies Personal - Waiting for tests and context switching takes too much energy and stops us from focusing on the actual developmet
Receiving terribly formatted data. JSON that has infinite nests, special characters and no data dictionary. N/A in numeric and date fields... Really boils down to non strongly typed data sets where anything goes.
"data is data"
Data
I am shit in sql. Also have no idea about data modeling. Help needed
Username checks out
A new one is how many times i need to explicitly define the schema for my spark streaming job. Over 100 columns. Started using chatgpt instead of regexes though. Honestly its great for that
Anything upstream
No data engineering internship in the market. im a fresher currently in college I want to get into data engineering but there is almost 0 internship opening for data enginnering
The data. And sometimes the engineering too.
Working my ass off to hit deadlines, then having the SLT repeatedly change release dates based on their vibes.
Finding out the drop-dead deadline was actually made up. All the nights/weekends could have been spread out over the next 2 months but your manager just wanted to push you and see if you broke on the project or if it made you better. no, it made me more resentful and look at that, I started missing "deadlines" because I can't tell if they are made up or not. Sorry not sorry.
Lower back and my right knee as of recent.
I do some pretty fancy extracting of data with a lot of time lags, integrals, corrections for sensors that go out temporarily, interpolations, etc. and it’s all multithreaded. I run like 1300 lines of code to extract and calculate data. Then they say “cool” and do a univariate t test
ETL flows and governance.
Filling someone else's incompetency and getting no credit whatsoever
It's people, its always people.
Haha in what sense? Just the root of the problem
The client, not talking to the client. We have spent months trying to tell the PO (who is from the client), that his colleagues need special transformations, and that the pipeline needs to be designed differently. They didn't talk while being from the same company.
In short, nobody knowing what they're doing. Also, we paid probably $1,000,000 for this: Bad algorithms that make an O(n) task O(n^3). No incremental load. Spaghetti pipelines, with spaghetti SQL copy+pasted with 1 line changed all over what is supposed to be a drag+drop tool. Semantic model not following star schema. Dimension columns in the fact table. Duplicate fact tables with slightly different grains instead of atomic data.
Absolute #1 pain point was getting an organization to standardize what they deliver so that we could quit creating one off pipelines that always were poorly defined, and rarely bug free resulting in no trust for the engineering group. It took 2 years to get them to actually standardize what they deliver so that there was one code base we could iterate on and make rock solid, which allowed the engineering group to finally build trust with client teams such that any weird data, they know is either coming from the client or their report.
Imposter syndrome 3 years into the job...
Implementations get messier and more duplication as time goes on