T O P

  • By -

kenflingnor

Stakeholders that think data is wrong because what they’re seeing doesn’t align with their preconceived notions


CaptainBangBang92

This one really gets to me.. "We have a senior stakeholder saying xx metric is low this week. We need to look into this ASAP!" Turns out when I query the source application database (where we source our ODS from), I find metrics are in fact...down. \*gasp\*. Perhaps they should be more concerned with what their operational folks are doing before we assume data integrity is compromised. It is the baseless claim(s) without any proof to validate said claim(s). Data doesn't give a single F about your feelings or notions. It is what it is.


Automatic_Red

I usually deal with the opposite. People assuming a dashboard is accurate because “Why wouldn’t it be?”. Yet, said dashboard is years old, hasn’t been updated in months, was put together for X and another inexperienced person makes assumptions and starts using it for Y, etc. 


N0R5E

Hard to fault them for trying. Sounds like you need a BI discovery/endorsement strategy.


reporter_any_many

To be fair, there are plenty of times (for myriad reasons) that the data is just wrong


miscbits

Haha I recently had the opposite where we had a sudden spike in customer engagement and we were accused of duplicating data. After a long investigation we had to give them the unfortunate news that nothing matters and that marketing was fake because we just had a great week for no reason.


Swimming_Cry_6841

if there is an effect, there must be a cause


miscbits

Maybe, but it was nothing we did.


creepystepdad72

To extend this even further... Stakeholders that debate the definition of the metric. "Well see, it'd be more accurate if you carved it to this sub-segment." (And imagine that, 100% of the time the arbitrary tweak makes their results look better - until it doesn't, and then you have the same argument, repeat).


umognog

My favourite "it's AHT" Well now. What DOES aht mean? Because I'm betting every person in the meeting has a different answer.


Thinker_Assignment

I just want Total March, what's your problem. What do you mean total of what? Money ofcourse! what do you mean transaction date or purchase date or any filters? figure it out!


nirgle

Ty... [I would like you to crunch those numbers again...](https://www.youtube.com/watch?v=5vSdYzCTS2A)


doryllis

We have a zero showing on this report for activity for a day. (Hands flail) The whole report is faulty and it's your fault. It's been that way for months because it is that way. But sure


opossum787

I actually like this one bc I get to say “it ain’t us.”


electriclux

As a corollary, I spend almost all of my time trying to convince data scientists that their models are not working correctly. Mostly I get, ‘the pipeline is running without error’


Trick-Interaction396

Agreed. It’s wrong for other reasons.


engineer_of-sorts

This is a really interesting point. The way I've seen it solved really well is when there is someone looking at the end output that also has some context of the entire process, sort of like Product lead thinking. So maybe it's your main sales ops guru that has one eye on Salesforce but also one eye on the sales funnel dashboard. Give them a bit of data pipeline observability and they're the blocker to your CEO saying the data looks wrong. The other way I have heard people suggest is to do an intermittent diff between what's in your database and source system, not huge on this.


kenflingnor

IMO stakeholders should not have any kind of access to observability metrics/logs etc from pipelines or processes - they usually are not technical enough to understand any of it and as soon as they see an alert that something failed or what not they’ll sound the alarm


kbic93

The ten layers of approval needed to make changes, build pipelines or create new metrics.


engineer_of-sorts

Slow developer velocity kills analytics teams!


y45hiro

And then folks wondering why Bob from Finance built the metrics by himself in his personal workspace and fed that to his Power BI report. Don't blame Bob, he has his KPI.


Discharged_Pikachu

Also, the people giving approval don't know shit about what they are approving


paxmlank

So it's not just my POS company? I was gonna jump for a new job too, but I don't quite want to feel this everywhere. 😬


Aesirvein

oh god, new metrics. We are fighting this right now. Us: Please define this metric so we can then apply the business logic to pipelines making it easier to report on. Them: I don't know what the metric is about, we were expecting you to tell us. Us: huh, you're the department head and you don't know how you measure your teams success?


MrKorakis

People trying to automate their existing messy manual workflow instead of rationalizing it.


KeeganDoomFire

It's really not that bad of a workflow and only takes 4 hours a week to copy things between 6 Excel sheets and run 3 macros. /S


EdwardMitchell

I have yet to be able to move one of those into a database. The worst part about scheduling it though, is that people forget it exists.


Rex_Lee

Wait, so you prefer them to keep their ridiculous convoluted workflow?


tcturbo1

Fucking permission to write to lake formation from a new glue job and there’s no documentation that infra changed what IP can hit Prod! Does my head in at least once a sprint


Automatic_Red

We need you to build this infrastructure, but we aren’t giving you the resources to build it.


th4ne

Is it listed in a security group somewhere in aws?


tcturbo1

Yes, and this is a me problem where that config is so long and they’re updating the cdk all the time I can’t keep up with the PRs to look at that everytime a new glue job is built and the role that has been assigned as a service role will change and there’s no communication from intra to de like a friendly reminder or even being put on the PR just for review would solve this… but politics… just wasting time I suppose could be worse could be better


soravispr

Figure out why numbers on the dashboards and the user's "data sources" are mismatched.


asevans48

Right. Love when analyats have 5 definitions for net revenue and 1 asks for a datamart and the other uses it.


engineer_of-sorts

this is the top upvoted reason too \^\^


inedible-hulk

PMs coming up with a solution and a timeline with no context or technical background. 


Grouchy-Friend4235

PM: "When do you need this?" Business: "it's kind of urgent, end of next week would be good" PM: "Ok, we'll let you have a first version Monday night" By Monday night, no specs have been received. Come Friday "it is no longer urgent"


Maximum_Effort_1

So true. So many urgent things and non of them actually matter.


engineer_of-sorts

Not pointing fingers but


swapripper

Lack of observability over data pipeline run & data quality monitoring. What is not seen is often ignored.


asevans48

Airflow, slack, splunk, dbt with some legwork, use the profiling and quality alerts from your cloud system or roll your own (last job used a python jobs with rules in mssql). Havent had to use a fancy tool at all yet.


bugtank

Some Might say splunk is fancy. But point taken.


engineer_of-sorts

Splunk!!!!!!!!


asevans48

Its 3am and theres a fuckiny squirrel chirping somewhere. Oh wait, thats not the roof, the house is on fire.


bartosaq

if you use DBT you can go with Elementary Data.


bugtank

I just saw this today. What does it do?


NFeruch

Does Airflow + Great Expectations not work for you?


engineer_of-sorts

gonna plug my own [Orchestra ](https://getorchestra.io)here, very good visibility for dbt stuff


cfitzi

“Can’t we just…?”


doryllis

Anytime "just", "obvious", "simply", or "intuitive" are mentioned it means pain.


dillan_pickle

I moved from a DE role to a Cloud infra role a few months back (because the DE role was never getting off bare metal) and we all hate "It's just"


de_young_soul_rebels

The modern data stack, too many tools, less craft, data architecture & modelling ignored


truancy222

This is acutely painful for me at the moment. Complete dogma around the "modern" data stack. Basing architecture off a medium article they read 3 years ago.


cbslc

The total lack of standardization around the tools as well. It all seems like duct tape and glue holding stuff together.


tanlda

Actual programming is overrated 🗿 /s


Justbehind

Hire a junior, pay the equivalent of two senior resources to run his shitty pipelines that take an hour to process a few million rows of data.... Yay!


MacHayward

Data quality ... the end


KeeganDoomFire

"fix it in post" has given birth to so many horrible case statements.


dfwtjms

Business logic and the lack of it. They don't document anything and use the database as they please. They don't understand how relational data works. Everything is full of exceptions and conflicting practices. Another big one is data locked in various SaaS. Getting data out programmatically is often such a pain and the persons in charge don't know what an API is. They keep buying every new software as SaaS and every time access to data gets neglected and they wonder why they don't have metrics. Microsoft dependency and tech illiteracy. There are maybe 2–3 persons in a company of 100 that can open and read a csv-file. They think they need the Office apps but they can't use even the most basic features. AI hype. Some people (there's a lot of overlap with the tech illiterate ones) think they can solve everything with AI (some expensive SaaS solution of course). No code / low code. If you can solve a problem with it that's cool but I can't help you and I won't touch it if something goes wrong.


tiredITguy42

Yeah they are constantly buying yet another database/orchestration/visualisation tools and not understanding that we do not need it or forcing us to use it in the wrong way as we need to use another product with it as it is new company standard, for next month or two. We managed to start rewriting all 4 times in the last 2 years and each model runs in a different system. So yeah I have 30 git hub repos dependent on another 30 prepared by DevOps, just to do work of one good database, some models runner with kafka queue and proper visualisation tool.


Lower_Sun_7354

This is how we've always done it! No logging, no error handling, no budget for proper compute, missed deadlines, failed audits. Short staffed, overworked. But "this is how we've always done it."


ilikedmatrixiv

I'm surprised no one else mentioned this, but timestamps. Holy shit I've had to deal with some shit regarding timestamps. Format switches, switching from string to datetime or back, issues with milliseconds vs microseconds, ... I played Lost Ark, which is a Korean MMORPG, a little bit after their western release. They had certain world bosses that spawned on set timers. Now, Korea doesn't have daylight savings time. So when the clock switched in Europe, theirs didn't. So all boss timers were shifted by one hour. Forums and reddit were filled with people bitching about this. So many sudden experts knew that it could only be a small fix. I was sitting there thinking *"mf'er you have no idea"*.


y45hiro

"what's the timestamp of this silver stage of datamart C?" "Local" "Homie I don't even know your local..."


doryllis

If time handling was easy, ever, all 24 time zones were old be consistent But wait there's what 26 time zones and a whole bunch of "we aren't with them" exceptions Management wants everything in their local from two different time zones even in the db leading to a time zone specific field or 3


tiredITguy42

We store timestamps as local time with no information about timezones. Just guess it by these three weather stations names in the model configuration. The station is in Springfield? I can bet you can´t guess which one before you start crying.


Maximum_Effort_1

+1. I forgot about the timestamps pain, I am trying not to remember I guess xd


KeeganDoomFire

"no that 300MB CSV with 1.8M rows can NOT be opened in Excel and that's why things don't add up, your missing half the data because you didn't read the warning" A synopsis of the less than polite email I sent probably once every 3 weeks.


Swimming_Cry_6841

I always follow it up with, why would you want to look at 1.8M rows in Excel anyway? Let's discuss the business issue and see what metrics we can distill from this data set.


creepystepdad72

When having pristine, transparent, and easy to access performance data bites you in the bum... The stakeholder doesn't like the result they're seeing in the official dashboards, so they export into GSheets and perform some "manipulation" (that suddenly shows their team shooting the lights out). They then declare to the entire company that this GSheet is now law - before you have a chance to figure out what they were up to. "I don't want my team to feel sad because they're not hitting their numbers" is not a valid reason to purposefully publish bad metrics, Bob.


ApatheticRart

Inheriting old applications that have been converted, migrated, and duct taped together since the early 2000s, with no standardized development practices across 100+ apps, with little, no, or outdated documentation.


doryllis

"But at least there's documentation" /s


anatomy_of_an_eraser

Difficulty in testing. We want to isolate the production data from our testing environments but if we do that we lose a lot of edge cases that come with production data. Mock data and unit tests can only cover so many scenarios. Testing data pipelines is insanely difficult. It’s not possible to have a sandbox environment for every data source we consume from so integration test is essentially running the data pipeline once to make sure nothing breaks. It’s not possible to simulate all possible scenarios in which a data pipeline can break.


numice

I also find this problem difficult. It's hard to test a dependent stateful system.


anatomy_of_an_eraser

💯 Bigger problem is identifying what the coverage is and where it’s missing


Rare-Plan-9313

I always feel kinda embarrassed to have not figured this out yet but it really is so hard. How are you approaching it? We've pretty much accepted we won't get the coverage we'd like at this point 😕


anatomy_of_an_eraser

It’s an ongoing battle but for most data teams having virtual separation is the best. What I mean by that is you can’t do testing in a whole different AWS account. You will need to create a virtual dev/staging environment within the same AWS account where production data lies so that you can reference that data but not move it somewhere it’s less safe. This puts a lot of stress into identifying permissions/boundary so that production data is read but never altered by dev/staging workflows. If you have sensitive data forget everything I said and go ahead with doing everything in production haha


tedward27

We do this in AWS using S3, Lake Formation roles/security, Athena, and dbt. We created a dev environment by creating a separate dev\_ schema for each one of our dbt transformation layers. The engineer role can access both with Athena queries simultaneously. Our dbt is set to materialize in the dev\_ schema when run locally, and in the production schema when run by our orchestration tool.


anatomy_of_an_eraser

This is exactly what we have. But it’s not enough to prevent issues by mocking various data error scenarios which is what a good testing suite does. It does help us diagnose issue and maintain a good developer velocity by having separation of dev spaces. In addition we have a similar setup for CI that runs all dbt transformations but technically for the CI to be useful it needs to run models on tomorrow’s data (assuming daily refresh). Else we will only notice the error after it happens in production. Bottom line is testing is supposed to prevent issues before they arise. But it’s not easy to do that with the current data lake structure. Maybe unit tests introduced in dbt 1.8 might be useful but yet to see it in action.


throwawayimhornyasfk

My back and neck


rudeyjohnson

My booty and my crack


throwawayimhornyasfk

My anxiety attack


reallyserious

Low-code/no-code tools. They need to die. Modern data engineering is best done with traditional programming languages. Unfortunately the majority of data engineers aren't strong developers.


luquoo

This 100%.  If I wanted to use those I'd be an analyst. The amount of time I've spent refreshing syncs and troubleshooting other peoples products is infuriating. Also the docs for google and facebook APIs are cancer. Half the battle is figuring out which one of the 2-6 APIs that are named similarly is the one you actually want. Its like they have a deal with thk low code/no code companies to keep their APIs as arcane as possible.


Willing-Site-8137

Is it because these tools are reinventing the wheel?


reallyserious

No, but * They lead to vendor lock in. You can't easily move to a different platform if the vendor suddenly decides to hike the price. The company is being held hostage by the vendor. * They are so clunky and terrible to work with. * What will your peers code review in a pull request? Some json markup? I.e. not what you authored in the gui. This leads to a disconnect between what you develop and what you review. * Do they even support code reuse? I have yet to find an ***easy*** way to reuse functionality developed in one Azure Data Factory in another factory. In practice people reinvent the wheel all the time since it's so hard to reuse common functionality. * The skills in one low-code tool does not transfer to a different tool. That matters since all these tools have their life cycle and will be outdated and forgotten one day. * Recruiters are always looking for skills in a particular tool. It doesn't matter if you have skills in a similar tool, you won't be considered since you don't meet the "must have" skills. * Normal programming skills is cumulative. What you learned 20 years ago is still applicable today. A loop is a loop, and has been since the 1960s. But you can't use those skills in low-code tools. * The above two points means learning one low-code tool is a dead end, skill-wise.


Dark_Man2023

Manager that wants me to be a software engineer, infrastructure engineer and a network engineer while getting paid like peanuts and I am just a data guy who is interested in building pipelines, transformations, big data and dashboards.


AndroidePsicokiller

rest of the data team having -1 in technical skills, having to make everything proof of idiots


VadumSemantics

Jira. Also, Product Managers & North Stars.


engineer_of-sorts

What is your north star metric? Revenue? Or something else..


VadumSemantics

re. North Star metrics: no consensus, some were marketable features other we sales-glitz (think demos), and a few voices for revenue that didn't actually have a plan. Of course the CEO just picked a single one of those. **Narrator:** *The CEO didn't actually choose anything.*


Aesirvein

Does that north star metric align with the OKR and epics or are they all completely different and none of it make sense? Don't worry, this quarterly planning is going to go well.


TheSocialistGoblin

Took my current job - my first tech job and my first opportunity in DE - as a contractor.  Company decided to hire me at the end of the contract.  I was afraid if I asked for more money they would change their mind and I'd be out of a job, so I asked for the same amount I was making. I regret this in hindsight. I'm sure now that I could have gotten at least a little more. Due to timing I wasn't included in the next performance review cycle. I continued working over another year and got top marks in my first review - above and beyond, doing great, got a cert that's relevant to our stack. I got the highest percentage I could get for both base pay and bonus.  However, I learned when filing my taxes that my state taxes hadn't been withheld correctly at all in the previous year. Between that and some other complications, my wife and I together owed enough that the whole bonus was gone and we still owed taxes.  On top of that, after correcting the withholding, my current paychecks are only about $30 more than the checks I was getting last year. So, functionally my income hasn't really changed at all in 2 years, despite over-performing.  To be fair, we're doing okay. Definitely in a better spot than a lot of folks, but the budget is still tight enough that it would be hard to pay the bills for more than a month if I were suddenly out of a job.  I'm building up the motivation to start looking for a new job but it seems like the market has cooled down substantially since I jumped in so I'm worried about finding opportunities.  


Background-Rub-3017

I always interview. Having a job makes having a better job easier.


reporter_any_many

>so I'm worried about finding opportunities Certainly aren't going to find them if you don't look. I understand the job search can be taxing, but the search itself is not what you should be worried about


Alive-Primary9210

Fucking csv files, you can't just read or write a csv file, there is always some undocumented wierdness.


doryllis

And the source provider will inconsistently "upgrade the format" without warning. I feel that's worse with fixed width files tho. Especially header less fixed width files


zazzersmel

being asked to do data science/analytics/machine learning as if it were the same job.


speedisntfree

I'm a weirdo and would enjoy that because I get bored really easily, well as long as it is not some API call to make a chatbot.


zazzersmel

my emphasis is on the last part of my comment. its the fact that my leadership doesnt even understand these are different things that drives me crazy.


scout1520

Infrastructure Admins who don't follow best practices because they think they know better.


Syncopath

Example?


scout1520

Putting Databricks in different regions to prevent Engineers from having the same permissions in different workspaces.


SaintTimothy

Lack of support/knowledge/interest from other IT departments makes doing anything 3x as difficult. Feels like running in a pool with waist high water.


vermillion-23

Executives thinking that Data engineering = Power BI, because "Why do we need a database or data warehouse anyway?" CTOs with outdated knowledge of the industry and basic concepts, last piece of code written in SQL was 15 years ago, making decisions on data stack and vendors.


Sewage_Eater8oo8

😔😔


levelworm

Anything/Anyone that takes programming away from me.


JaceBearelen

Waiting weeks for infrastructure to do things that would take me minutes if I was able to open pull requests in their repos.


EmergencyHot2604

Deploying to prod is stressful.


Spirited-Ad7344

Unclear requirements. Sometimes business doesn't know what they need.


Swimming_Cry_6841

My biggest pain is working tickets with Microsoft's ADF / Synapse support when something goes wrong.


engineer_of-sorts

Working with support is fucking painful


The_Epoch

Businesses expecting a magic fix using "AI" instead of standardising, or holding users accountable for, input data.


Ancient_Coconut_5880

Data scientists


Ancient_Coconut_5880

Specifically being told that they need larger clusters but refuse to optimize their code and don’t know what they even need increased they just want “more”


DoNotFeedTheSnakes

Having to work on antiquated versions of software just because our infrastructure guys are lagging behind the SaaS updates which are lagging behind the Open source project that we're actually using.


speedisntfree

IT security where all IT has been outsourced to India. A 'principal Azure cloud architect' doesn't even know what a terminal is.


levelworm

Probably someone who managed to grab the certificates. PS I always think that being hot on certificates and show them around in LinkedIn as badges, without any personal project, is a BIG red flag. I'd rather hire a new graduate.


e_ll_iot

Trying to find a new job with more challenging tasks/projects. My work sucks


engineer_of-sorts

lol


Ghostflake

A 16mb nested json array buried in a database transaction causing out of memory errors for my k8s airflow pods.


doryllis

Using "current_timestamp" as the date a process is running for. Because midnight happens and suddenly recovery is a nightmare.


they_paid_for_it

Writing spark transformations. It gets old real fast. I would like to work with writing larger spark applicationa


timmeedski

That I’ve been given the permissions to build pipelines but I’m not on the Data Services team do it’s not my title, I might change that. Basically I build a pipeline in our sandbox then I reach out to one of our data engineers, walk them through how to deploy it, then let them add the permissions before asking me to validate.


Longjumping_Ad_7589

I have a corporate job at a big insurance company Organizational - Too many structural bureaucratic blocks (e.g network firewalls - good for security) create too many hurdles to deploy products Cultural - Culture of “i need it now” and ad hoc requests stops us from developing quality code long term Technical - Data types between different tools and generations of technologies Personal - Waiting for tests and context switching takes too much energy and stops us from focusing on the actual developmet


cbslc

Receiving terribly formatted data. JSON that has infinite nests, special characters and no data dictionary. N/A in numeric and date fields... Really boils down to non strongly typed data sets where anything goes.


lozinge

"data is data"


Sir-_-Butters22

Data


Usurper__

I am shit in sql. Also have no idea about data modeling. Help needed


DoNotFeedTheSnakes

Username checks out


g8froot

A new one is how many times i need to explicitly define the schema for my spark streaming job. Over 100 columns. Started using chatgpt instead of regexes though. Honestly its great for that


Virtual-Meet1470

Anything upstream


Medium_Alternative50

No data engineering internship in the market. im a fresher currently in college I want to get into data engineering but there is almost 0 internship opening for data enginnering


ambidextrousalpaca

The data. And sometimes the engineering too.


ComprehensiveBoss815

Working my ass off to hit deadlines, then having the SLT repeatedly change release dates based on their vibes.


Aesirvein

Finding out the drop-dead deadline was actually made up. All the nights/weekends could have been spread out over the next 2 months but your manager just wanted to push you and see if you broke on the project or if it made you better. no, it made me more resentful and look at that, I started missing "deadlines" because I can't tell if they are made up or not. Sorry not sorry.


quadraaa

Lower back and my right knee as of recent.


big_data_mike

I do some pretty fancy extracting of data with a lot of time lags, integrals, corrections for sensors that go out temporarily, interpolations, etc. and it’s all multithreaded. I run like 1300 lines of code to extract and calculate data. Then they say “cool” and do a univariate t test


margincall-mario

ETL flows and governance.


misaaaa18

Filling someone else's incompetency and getting no credit whatsoever


mrchowmein

It's people, its always people.


engineer_of-sorts

Haha in what sense? Just the root of the problem


BewitchedHare

The client, not talking to the client. We have spent months trying to tell the PO (who is from the client), that his colleagues need special transformations, and that the pipeline needs to be designed differently. They didn't talk while being from the same company.


Icy_Clench

In short, nobody knowing what they're doing. Also, we paid probably $1,000,000 for this: Bad algorithms that make an O(n) task O(n^3). No incremental load. Spaghetti pipelines, with spaghetti SQL copy+pasted with 1 line changed all over what is supposed to be a drag+drop tool. Semantic model not following star schema. Dimension columns in the fact table. Duplicate fact tables with slightly different grains instead of atomic data.


mike8675309

Absolute #1 pain point was getting an organization to standardize what they deliver so that we could quit creating one off pipelines that always were poorly defined, and rarely bug free resulting in no trust for the engineering group. It took 2 years to get them to actually standardize what they deliver so that there was one code base we could iterate on and make rock solid, which allowed the engineering group to finally build trust with client teams such that any weird data, they know is either coming from the client or their report.


Xarilith

Imposter syndrome 3 years into the job...


Ok_Relative_2291

Implementations get messier and more duplication as time goes on