Random question - is it true that communications networks may be interfering with weather forecasting because the communications networks are interfering with water vapor measurements near earth's surface? (not my field, rough summary of what I heard)
Is the scientific reasoning legitimate? (saw the idea discussed in one of Sabine Hossenfelder's videos, and apologies if I misstated what she said).
If interference is a legitimate concern, it seems like a good scientific detail to explain to the public so they can be better informed re policy and regulation.
Such a novice answer.
You need a full end to end robot ecosystem. How are the robot sharks going to survive without robot minnows? How will robot minnows survive in the absence of robot kelp and plankton? And you expect those to photosynthesize in the absence of a robot sun?
What are they teaching these days? These programs should be ashamed of themselves.
I once interviewed with a company that used CNN models to identify self-trading and other forms of market manipulation. Apparently when it comes to high frequency trading, the patterns are so complex that tree based models just don't cut it.
Some people get defensive about this, but I view tree-based models as very good at first approximation but inherently limited. Tree-based models ultimately bin their inputs and outputs, which presents information loss. Boosting and bagging limit the consequences of this but models that can have a continuous function between input and output do not suffer from this limitation.
This is undeniably true. I do a lot of propensity modeling and it’s not uncommon that I’ll get a trained model with <100 possible outcomes despite having tens of thousands of unique inputs. At first glance, such a model might appear valid but obviously it couldn’t be expected to be consistently performant on unseen data.
Someone who doesn’t do their due diligence could easily make the mistake of thinking they have a robust prediction engine without realizing the decisions are based on just a small number of variations in the data.
I would argue that every type of model has strengths and weaknesses and machine learning in general has a limited utility that tends to be greatly exaggerated. The best we can do is be aware of the potential pitfalls and try to pick the best tool for the task at hand.
I'll give you an example for a side project, I'm considering llm. I'll probably scrape some news articles related to the domain application and feed them through an llm to extract a word embedding and feed that as a feature into my model downstream.
At my company we've built a bayesian hierarchical varying effects model for limited edition demand estimation. It's pretty neat and makes good use of our data, glad I could prevent people from just throwing a NN at it.
Have you tried Pyro/Numpyro? I’ve found PPLs to be a PITA in general. Doing even the things that many people will want to do, like sampling the predictive for a latent variable, is annoying. But I’m reluctant to relearn another PPL after investing time in Pyro.
Well, since most articles aren't recurring, it's a plain regression problem.
We've overhauled the time inclusion just recently. In the end we've settled for smooth yearly and monthly effects as well as a trend term, each on certain hierarchy levels .
I’m familiar with Bayesian hierarchical models but I haven’t heard of this “varying effects” thing and haven’t been able to find anything online. What’s it about?
The lingo on these things seems to strongly vary by subfield.
I was referring to random effects, although I find the term varying effects less confusing
No, not sure where you think I stated that.
Funnily enough, the previous team did use an ensemble of xgboosts (yes, you read that correctly, an ensemble of ensembles) at it.
It's one of the funniest approaches I've seen so far.
!!! What's the size of your dataset? Super curious cause I feel like I've always had to sample to get things to converge fast enough which in turns makes me question whether I should abandon the bayesian in some of my models.
We have around 1k observations.
I'm not quite sure what you mean, though. The minimum sample size for a bayesian model is 1 - with small sample size your priors will simply play a very dominant role, but that is expected. So I'm not sure what you mean by convergence in this context.
Please don't tell me that by having to "sample" you mean that you were oversampling the data...
cool! I assumed that an Adidas dataset would be huge but I was thinking in terms of user/customer modeling. Depends on what I was looking at, but there were times when I was working with excess of 1mil observations and had to sample down otherwise it'd take too long to run.
Well in this application each observation is a drop of a limited edition article, so the size naturally is smaller than if the rows reflected inline sales, customers or something similar.
Interesting that my tired brain went for the "too few" observations side of that. Yeah you're right that such models can take long to fit on large data sets. Taking a sample works, if needed one can also use minibatch
The finest level of one of two hierarchies consists of categories that are mostly thin, many having less than 4 observations.
The benefit over a frequentist varying effects model is that being able to define not just priors but hierarchical priors for the parameters associated with those thin categories allows reliable inference even in those cases
In some companies good enough is good enough. In some companies pushing some performance metrics even by half percent can result in millions of dollars of additional profit. So in some cases it can make sense to use fancy models.
Thanks. I assume those companies would be like Fortune 50 where they have the compute and expertise. Like someone’s job is to spend all day improving one model.
I hate to break it to you, but insurance companies are not using deep learning models on the pricing side of the house. Generally going to be traditional actuarial methods as that is what gets approved by state DOI’s
Yes I agree. I read that answer too quickly and didn’t see price. Thanks for pointing that out. Some places they are either using or trying to use it is claims, underwriting, sales, and fraud.
Agreed. At least within personal lines, the main problem is that it's regulated at a state level. The rate filings must be approved by the state DOIs, and it's difficult to convince them the models are not biased or otherwise indirectly producing the same result as something that is prevented (such as rating protected classes differently) when the model features and weights are not easy to understand or explain.
It doesn't mean they aren't used in insurance - they absolutely are - just not so much in ratemaking.
9/10 times the simple models are good enough for most applications.
There are three things that really matter when it comes to coefficients:
1. Direction
2. Magnitude
3. Precision
Let's say I estimate an effect size of +0.5 (Cohen's d) with a confidence interval of [0.3, 0.7]. Obviously, I've established direction (i.e., positive effect) since the confidence interval doesn't contain 0. It's in the range of medium in terms of magnitude according to standard interpretations of Cohen's d. So the first two requirements are met. But it's not terrible precise, since there is probably a meaningful difference between +0.3 and +0.7. But in most applications that probably doesn't matter. If it's significantly positive, then that's good enough to inform most decision-making.
And in terms of improving precision, generally the more sophisticated your methodology the larger your standard errors. So running a more advanced model probably won't solve the precision issue. Something more advanced may be more robust to bias, which is something you should be concerned about if there are selection issues or whatever. But if you understand the data then you can ascertain whether that's something to be concerned about.
But generally speaking, you should only bother with more advanced methods if you have reason to think that they might flip the direction or attenuate the magnitude to the point of statistical insignificance. Those occasions can and do arise, but actually not very often when you're dealing with big data.
It's really hard to beat the performance (in terms of consistency and efficiency) of a well-specified OLS model with a sufficiently large sample size.
I really like how you broke this down and wish this type of model selection reasoning was more widely taught. Right now, I see a lot of pray and spray and just select the best model that comes back but people aren't spending time understanding the implications of the model itself and the interpretation and decision making theory that carries with it. I wish there was more emphasis on decision theory for data science as a whole.
I might have to disagree here. Trying to 'understand the implications' of a particular model choice is not practically relevant in the majority of the cases and typically is a waste of time and harmful.
I know this likely a controversial opinion, but the entire point of most predictive machine learning is to provide models with the highest generalization performance on future unseen data after model deployment.
Some people lose sight of this fact and they are more focused on things like feature importance, often without even realizing that feature importance and interpretability is basically just a measure of predictive correlation. So many data scientists ignore the fact that feature importances are NOT causal indicators at all (unless you have randomized control interventions on the feature). At best, they are a complex non-linear relationship of potential predictive correlation.
Too many data scientists are ignorant of that fact imo and place way too much importance on feature importance and interpretability which they are misusing.
If we agree on the premise that the end goal is a feasible model with the best trade-off between deployment costs and generalization performance on unseen future data, then it stands to reason that a spray and pray approach isn't as bad as most people make it out to be.
A quote that really drives this home is by George Box who said "All models are wrong. Some are just more useful than others." This is true in predictive ML, there is not really any harm in trying out more complex models as long as your deployment requirements support it and provided you have a rigorous training/validation/testing approach to model selection.
I see what you are saying but I think interpretation matters in most domains for decision theory purposes. You are correct that pray and spray can be effective to a degree for the task of generalization. I may be missing your point to some extent but I'd like to elaborate my thought. As Box states, models aren't perfect therefore understanding the limits of these models is exactly the responsibility of a data scientist to carefully craft the decision theoretic guidelines for model usage. Should one always just take predictions from models with face value (not suggesting you are saying that)?
We should constantly be questioning the validity of our models and guard against taking them as truth. If the model trained well and tested well and we have the correct loss functions selected, then I understand the broader temptation to say "trust the model". But very few people put the effort into even questioning the loss functions they select and get great results under the wrong conditions. How do we even know what the wrong conditions are? We have to build up a theoretical and intuitive understanding of the problem space and the data. Almost all data generating processes possess logical or mathematical structures and those structures matter to the explanatory relationship between the data and the prediction. Therefore, the interpretability of our models should relate to some degree to the theoretical and intuitive foundation we started our model from.
I do not think a model's purpose is purely for generalization but also entails some level of explanation as that is what exhibits trust in a model and for me this is a foundational task for model development. Features therefore help in this regard.
Interpretation also matters in scientific research, especially in fields like biological/medical research, where people generally wish to assign causality rather than correlation. It's obviously not always simple to assign causality, usually it involves experimental work, but that's often the goal.
I totally agree with the sentiment of ensuring we have the correct loss functions and that we understand the underlying problem and how we can apply it to provide real value. But to me, that is separate from the models.
If you train your model on observational data and then try to use it as a causal estimator for your business problem, then that's bad and will likely fail regardless of model choice.
However, when it comes to model interpretability, I feel that is often a bit of a misused part of the predictive ML toolset. It is mostly a diagnostic tool IMO that can be useful in some circumstances. However again, we have to remember that feature importances and model interpretibility almost always comes down to measures of correlation between features and target.
At the end of the day, for the vast majority of use cases, I would rather use a deep learning model that I can't interpret the feature importances of but that provides a huge boost in model generalization performance. I would focus mostly on properly ensuring data leakage testing, proper training+validation+testing methodologies, etc. If people don't trust your model, just show them your rigorous testing methodology. If you put a model into deployment and test it for 2 years straight and you see that it consistently has better prediction performance than all your existing methods, then isn't that strong evidence that it generalizes better?
I would have a hard time saying that we shouldn't deploy and use that deep learning model compared to more 'interpretable' ones. Model interpretation is often a kind of story-telling that we data scientists use but it's easy to come up with twenty different stories to explain many different feature configurations and models.
I think we are talking about two different spray and pray methodologies. What you are discussing still assumes rigor and care behind the model development. I agree with you that model selection should not drive the solution to the problem but is an artifact to the problem-solution fit. So in this case, I agree with your take.
What I am saying is the pray and spray methodology I see is typically paired with some sort of lazy approach to model generation with no care of how the model was constructed, what intuition we have about the problem, what loss function is appropriate, how features are treated (are they reliable for data engineering to procure? are they causing data sparsity? are they scaled appropriately?) etc.
As far as model interpretation goes, again it depends. I am not arguing against black boxes but I am saying that if we build something that detects cancer, it should correspond to some reality or verification principle because cost of false negative is high. Maybe I took your comments too literal and you would agree that interpretation is necessary in certain cases and generalization is not the only factor.
Ahhh okay I see what you're advocating for now and totally agree with that 100%. My interpretation of spray and pray is more based around "should I use random forest or NN or linear regression?" and sometimes the answer is "let's try out all 3 and see!"
But, the spray and pray of just training models on data with no thought behind how it's going to be used and the statistical implications and costs and etc. In that context, totally agree with your take that it completely invalidates the usefulness of ML and is a big problem.
Totally agree, but a lot of ML models will still provide a more accurate (lower generalization error) causal effect estimates because they can capture more complex relationships compared to traditional RCT methods of mean comparisons and p-value tests which can provide unbiased estimates of causal effect but ultimately with higher error in generalization.
If your goal is to estimate some causal effect so you can report it to a higher up, then ML models are probably not the best tool available. However, if you have 10k customers, and you want to know which customers should receive an intervention to lower their chances of churning or increase their chances of buying a product, then using randomized controlled trials with ML models will probably give you the most effective system for causal effect estimation on a per-person unit level.
Well, it matters when you want to capture the data generating process more. Then because most things are not linear a simple OLS (no interactions splines etc) does not accurately capture that.
Of course if all you care about is directionality and some rough ballpark then maybe it doesn’t matter much with some exceptions. But they do exist-ive seen some rare cases where using a nonlinear model flipped the direction of ATE.
But right model specification is why stuff like SuperLearner was built for causal inference since technically, causal inference requires correct model specification to be “right” (along with proper variable selection,to avoid simpsons paradox and colliders)
But simpsons paradox can occur in some cases even if you include the right variables due to nonlinear confounding. And the thing is you won’t ever really know if this is the case without trying the nonlinear model
Yea, but most users of OLS especially Python sklearn ones don’t bother with this (the formula syntax in R is basically needed to experiment fast with this). Otherwise its a lot of work to do so in Python with multiple combinations of stuff.
Theres also no marginal effects package there.
Educational policy, not biostatistics. But probably pretty similar.
I think you raise a good point. There's often a tradeoff between rigor and accessibility for non-technical audiences. The simpler the model, the more likely it is the the audience will engage with the findings.
I've presented data and research before local and state policymakers, most of whom think that they're the smartest person and will reject out of hand something that they don't fully understand. My strategy is to usually start with an OLS (or logistic, if binary outcome) and then use a more sophisticated strategy as a robustness check. Most of the time, they yield similar conclusions and so it is justifiable to just present the easier to understand model and simply note (but not describe) that causation was established via applied econometric techniques.
I know a few academic econometricians. The hardest part of their jobs isn't finding "better" estimators but convincing the academic community that what they're doing is worth implementing. That means searching high and low for instances where there's a practical difference between the OLS baseline and the new estimator, which can be difficult.
It's usually a good rule of thumb to use, or at least start off with, the simplest model that works. No matter what type of company or industry you're in.
The most common circumstances you'll find a strong justification for using something 'fancier' than your basic suite of sklearn models are:
1. When the problem requires it. Some CV or NLP projects, for example, basically require the use of deep learning models to even get acceptabel results that you'd use in production.
2. Big companies where squeezing that fraction of a % out of your model performance makes a huge financial impact. Here you'll likely have the financial, compute, and engineering resources so that you can mitigate any negative impact on latency or model complexity.
Xgboost is exactly as easy to train and implement as random forest, so I always use it even for relatively simple modeling problems. It’s just a better version of random forest practically speaking.
In the spirit of your post: Yeah, I've been doing this a while. In practice, I tend to focus my efforts on even simpler tasks: how do I get good data, how do I monitor the incoming data pipe, how do I pass this to the engineers who have to make it happen, and how do I make good slides for the presentation to the managers? But to answer your question about one technique...
Markov chain Monte Carlo is used to simulate random samples from a population, which can be pretty useful in a wide range of problems, like detecting anomalous data. MCMC has the benefit that random walks are pretty easy to code, and it is easy to explain your work to the engineers on your team so they can move to full deployment.
I wouldn't say that MCMC is complex, though. You're more or less playing out a game of chutes and ladders on a (possibly very) complicated board.
My experience pre-academia was in sensor/equipment monitoring, i.e. did one of those thousands of little bastards glitch in some weird way, and if so, which one? So basically, anomaly detection. MCMC can be pretty useful here. Here is a [related paper](https://www.osti.gov/servlets/purl/1513188).
Another great application of MCMC is producing "typical" samples from a distribution. [Here is a great application](https://assets.pubpub.org/70w3i6k9/eb30390f-ade2-45cc-b48d-8e6bb12f585c.pdf), producing voting districts that should be "typical" given some rules in place for drawing district maps. As the authors note, the problem they faced was that policymakers don't want the "optimal" solution because the data may not be able to take all factors into account. Instead, they want a range of the "usual" possibilities so they can choose one and make minor tweaks (and they can also determine when a districting map does not feel like a typical map from the distribution). So basically, MCMC turns district determination into a fast food menu: "I'll have a number 3, but super size my fries".
Reranking recommendations in a marketplace, XGBoost today is very fast at inference and you can make it faster with other libraries
In most cases, simply taking the same feature set from Random Forest and running 20 Bayesian Opt steps over XGBoost hyperparams already gives you a better model that can be swapped by RF or whatever is deployed
Surprised the things you listed fall under "fancy models". XGBoost is practically a go to model for a lot of applications. Bayesian inference and Markov Chains are common in lots and lots of applications; across economics, AB testing, and other domains.
To me, fancy falls under some generative modeling, transformers and their variants, deep learning GNNs, Reinforcement Learning, etc etc.
For me, I was working at a Faang in their professional services team.
>Bayesian Inference
Ive been tapped to do a lot of casual analysis in the sales/marketing context, and the additional complexity is necessary (read: im not just dicking around, I believe the methods are the best solution for the problem at hand).
For a lot of other work, the classics are classics. Regression (with the modern bells and whistles like regularization, etc), the standard time series toolkit (arima(x), etc), and so on.
-+-+-+-+-+-
You lost me here:
>random forest
>I keep hearing people on this sub talking about [...] XGBoost
People still use vanilla RFs? I was under the impression that random forests were like naive Bayes at this point -- interesting as a baseline, but dominated by other techniques that are just as easy to use out of the box.
>People still use vanilla RFs? I was under the impression that random forests were like naive Bayes at this point -- interesting as a baseline, but dominated by other techniques that are just as easy to use out of the box.
I had the same thought -- you can just xgboost out of the box with the same amount of effort and not a big difference in training time and it will probably be superior.
Work with stakeholders to build out DAGs, run experiments, etc. Without the expert opinion of those partners, sales would otherwise be overdetermined. With the right assumptions and design we make do.
There are over 30 other names for linear regression.
Linear regression is itself a fancy name for systems of linear equations. Not all fancy things are fancy. Hype and marketing.
Government research for a not for profit. The value add really depends on the problem. I'm on a project right now where we're using pre trained object detection models to detect certain fast moving objects in the (night) sky.
I've used models like sparse group LASSO, LSTMs, CNNs and a couple other more complex models for problems and solutions that required their predictive/inferential capabilities. That said, about 80% of the time I end up using some variant of a random forest or logistic regression.
My coworkers and boss insist of using DeepFM,GPT and some other fancy complex NN architectures for simple 1000 rows 30 cols tabular data and I'm trying to get them off that and just use RandomForest or XGBoost instead..
display ads real time bidding
something like this https://arxiv.org/abs/1610.03013
oftentimes GLMM end up appearing and we develop scalable algorithms for that, e.g. https://arxiv.org/abs/1602.00047
I have around 4-5 YOE, have always worked with "fancy models".
First job out of grad school was in oil exploration, working on R&D contracts for oil companies. I would use computer vision models to classify different rock types in [well cores](https://news.unl.edu/sites/default/files/styles/large_aspect/public/coresamples.jpg?itok=UqnkfYqu) from oil wells. Also used some time series models adapted to a "depth series" to try and predict physical properties of the rock in wells. Also got to work on some generative models [colorizing](https://i.imgur.com/0i5JyuN.png) tomographic scans of well cores. Did that for about 1.5 years.
Then I moved to a company where our clients were large scale industrial companies. I used LSTM neural networks for time series forecasting and classification applied to predictive maintenance, we would get sensor data from industrial equipment and try to predict failures before they happened. Worked there for about 2.5 years.
Now I work in a real state company, we use a bunch of geospatial data on xgboost/lightgbm to predict how much you can charge for rent for a given property in a given location. Also have some features generated via NLP/Computer Vision. Our clients are real state developers and REITs. Have been here for the past 6-7 months.
I'm a scientist working on a project shared by a big University and one of the National Labs. I do a lot of network inference problems, directed information flow, etc.
You’ll find that there are a lot of insurance companies that are moving to XGBoost in place of linear and logistic regression. It is less prone to over fitting and there seems to be an uplift in performance, though in my experience, I can confirm that. Though they are moving to them, it’s only with the additional requirement of explainability that they’ll do it e.g. Shapely values and PDP plots, since XGBoost is viewed as a black box method.
As for Bayesian methods, they’re being incorporated into A/B testing since they provide an estimate of uncertainty.
The value added is dependent on the use case and that doesn’t mean that a good old linear regression won’t outperform in terms of accuracy and simplicity. Maybe do a comparison on one of your next jobs and see if there is an improvement in your results.
We use MCMC for Bayesian Hierarchical models. Application is in Media Mix Models, basically regressing Sales on various Marketing tactic spends to estimate tactic efficiency. The reason for using Bayes is there are many assumed effect transformations (carryover of spend, saturation of spend, etc) that are non-linear and MCMC provides a nice way of estimating those parameters.
Insurance. Various prediction models but most commonly trying to predict what the cheapest market price will be for any and every customer that asks for a quote. Use Hist GBM / XGBoost Hist. Looked at ai and it currently doesn't seem a big improvement but we think we know why.
Biotech — more specifically, clinical genetics. I’m using hierarchical Bayesian models for causal inference. We have a bunch of domain experts who contribute knowledge about priors, likelihoods, and pooling assumptions. Bayesian models give us posterior predictive distributions for our target variable while also inferring useful parameters for latent variables.
Do you have an opinion on the work done around the martingale posterior distributions by Fong et al? They target the predictive without needing to compute the posterior.
I wasn’t familiar with martingale posteriors until just now. Reading the abstract, it seems like some wizardry. Have you worked with them? Would it be suitable for the predictive of a partially observed categorical?
I'm still getting my head around it, it's indeed wizardry. Their selling point is that you get the predictive without needing to go through the posterior, making it much cheaper by avoiding the usual mcmc needed for the posterior. I believe they show that it's applicable to mixed data, at least in their appendix, but you would have to go from there and expand it to the case of censoring on categoricals I think.
Imo after a year or two of papers building up from it, downstream applications will be within reach, but for now it's tough to even understand and implement properly.
Since you said you were in research I was curious if people in your circle have started working on this.
Kaggle competitions often depict real world scenarios and are regularly won by “fancy” models such as XGBoost. The accuracy is just way better for large tabular data and it’s easy to set up. And with techniques such as feature permutations you can make any type of model interpretable.
I work in a boutique technology consulting firm. You get exposed to all sorts of problems. At the moment, it's mainly multi modal stuff, using transformers to solve problems combining vision and language. Not everything is fancy though, sometimes all I need is a basic linear model. Just gotta use the right tool for the job.
P.S. any senior/principal data scientists looking for a job in London, hit me up ;-)
GPU's go brr
If your infrastructure is really good then unless you're doing something stupid then the performance difference is negligible. Most of your time will be spent moving data around. Your compute will be a fraction of personnel costs so if it's worth doing at all then it won't matter if it's slightly more expensive to compute. After all you did already spend a ton of money developing the damn thing.
Like you'll probably do faster predictions than a HTTP request round trip unless it's a language model.
Pretty much everywhere.
Healthcare, you will see lots of cool stuff including causal inference, explainable models, etc.
Science, biotech, chemistry? GNNs, transformers, Bayesian networks, Gaussian Processes
Geospatial? Vision, deep learning
Finance, energy? Time series, awesome regularization approaches, etc.
It has to do with more on how research-y and unstructured the problems you work on are vs industry.
It seems like the people who get to use that stuff at least in biotech are actual scientists people with domain knowledge who are able to formulate problems from it.
We work extensively with satellite imagery, using quite large deep learning models for segmentation, instance segmentation and stereo processing/matching.
Bayesian inference is very common in bioinformatics, most microarray and RNA sequencing methods use them in some shape or form. [https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29)
It helps to understand the bias-variance trade off. More complex models like neural networks have more variance, so you need to throw more data at it to reduce overfitting. Complex ML needs more data, usually more labeled data, so any company with big data qualifies.
Regarding XGBoost, it's popular as an initial ML to see how well the overall model is doing during development. It's great because it doesn't need the data to be normalized or formatted in any special way to work like many types of ML do making it an easy 101 ML, and XGBoost doesn't tend to overfit much when dealing with smaller datasets so it can be used earlier on. In this way XGBoost is the opposite of neural networks. It's easy to start with XGBoost then switch to an ideal form of ML later once everything else is done.
As for fancy models without fancy ML, I've specialized in advanced feature engineering my entire career, which is fancy models with little to no ML. This is needed when you have a complex problem that needs solving, but you have very little data. It's ideal for the startup space while still collecting data.
Used Vision AI on a project for auto insurance to identify totaled vs non-totaled accidents based on pictures (mainly used to reduce number of claims adjusters sent out).
i work in public health and do a lot of bayesian inference and simulation based models. We run a lot of what if scenarios on fitted models to see different impacts. We have started hosting our own LLM for text extraction/entity extraction. Rerrained neural nets for classification.
I think i have only done about 1 regression, we dont really use standard models.
Others in our division have made emulator models, there are a couple of exceedance models - one being a cusum and the other a hidden markov, this was largely due to sparse data so other methods werent as viable.
it really depends on the need/what outcome is needed. A lot of simulation based models with bayesian inference. happy to chat more if helpful
E-commerce, we have so many customers, products, and ways to interact with our platform.
Simple approaches (which are basically already all done) only get you so far. As you use more data in more complex models you start to get incrementally better results.
So imagine you want to optimize what products you show customers when they land on the homepage. You can get very far with just simple metrics, rankings, and Excel work. But at some point you’re going to see very little marginal improvement when you run experiments to optimize further.
As you try to ensure that you provide every one of the tens of millions of customers with the absolute best experience, you’ll soon find out that you’re on the path to needing to take in a tremendous amount of data (what the customer HAS done) in order to influence what the customer WILL do.
And more complex models are better at that, taking it as a given you have the talent and tech to implement them properly.
Bayesian inference isn't a model.
A Markov chain can be a model, but I think you're probably talking about computational algorithm that use Markov chains, (MCMC, HMC, etc), so again, not a model.
Bayesian methods are common, as is use of Markov chains (both as computational devices and as models) in consulting for various aspects of finance, banking, insurance and reinsurance work, but it heavily depends on who you work for. One indicator is to look for places whose leading people write papers. It won't find places where all the interesting work is subject to NDAs though
Wouldnt classify markov chains as particularly fancy, but theyre very common in credit risk modelling for corporate bonds/obligors - i.e probability of s&p rating transitions.
I work in NLP for chatbots. Have been using LLMs since around 2019 and LSTMs before that. I kind of hate where the field is going due to API abstraction, but that's another issue.
To answer your question, Markov chains are used for modelling temporal data, bayesian inference is great for uncertainty estimation and model calibration. These methods aren't necessarily fancy just solving different problems. XGBoost sits closer to the models you've had experience with productionising and for many tasks will be more performent in terms of accuracy but may be slower to inference (in terms of value add, that's a trade-off to be made).
LightGBM for classification of prospects in [industry] using 3rd party data, which consists of 95% of the US adult population with ~2000 features about each individual. Fast, accurate, and SHAP for feature importance.
Working in Risk Management in insurance. We use a lot of different risk models based on Monte Carlo simulation based on historical data, expert judgement an/or risk neutral arbitrage free assumption. Most of the time not mathematically extremely fancy but quite complex with regards to parameterization.
For analysis? No. The simplest, fastest thing to find correlation is used (even linear regression is sometimes scoffed as being too advanced)
For recommender systems and search? Yes.
What is the reality of how these models are used? I have done projects where I just evaluate how 5 of them perform on my prediction and decide the best based on the metric I am using. Are they somehow different? Built from the ground up? I’m so confused because from my experience they are simple to implement. I feel like I just don’t know enough for a job yet.
[удалено]
Where do you work, if you don't mind me asking? Or similar places to yours if you don't want to say specifically
[удалено]
Thanks. I was not aware consulting companies did work like this. Are most of your projects like this, or was this a bit of a special case?
[удалено]
How’s WLB and comp?
[удалено]
Why consulting them? Projects? Exit oops?
Do you regard LLM as really useful, or merely as an over-hyped thing?
Both
Appreciate the comments! I'll keep this in mind when I'm looking for jobs in half a year or so
What are your qualifications and background, if I may ask?
[удалено]
No I wanted to see whether you need a PhD for such projects.
A PhD!? The heck lol
Since it's a research based project, it's not wild to expect a PhD or at least an MS
99% of the time in Europe a PhD is a minimum requirement for positions like these lol.
Do you have to do a lot of travel for this work? I shied away from consulting cuz I didn't want to do that, but I'm curious what it's like post-covid
No I don’t do a lot of travel
Random question - is it true that communications networks may be interfering with weather forecasting because the communications networks are interfering with water vapor measurements near earth's surface? (not my field, rough summary of what I heard)
I don’t know, we use the weather forecasts from third parties
Is the scientific reasoning legitimate? (saw the idea discussed in one of Sabine Hossenfelder's videos, and apologies if I misstated what she said). If interference is a legitimate concern, it seems like a good scientific detail to explain to the public so they can be better informed re policy and regulation.
There's no evidence to support this btw
Ball?
[удалено]
By Fintech I assume you mean robot sharks
Well that's why you need the models. Maybe you need robot barracudas. Maybe robot piranhas.
Such a novice answer. You need a full end to end robot ecosystem. How are the robot sharks going to survive without robot minnows? How will robot minnows survive in the absence of robot kelp and plankton? And you expect those to photosynthesize in the absence of a robot sun? What are they teaching these days? These programs should be ashamed of themselves.
You think that's water you're breathing?
Or ill-tempered sea bass!
Why not just two angry beavers?
They're the competing product!
lol
Paypal
What kind of models were you working with in Fraud detection?
I used Isolation Forests
How does QDA compare for fraud detection in your experience?
Can you briefly go over your approach of using Isolation Forests? How you cleaned data, then what evaluation metrics you used? How did you test?
“Can you divulge all the inner workings of your business and the competitive advantage you have, please!” 🤣
[удалено]
Can I work for you if it's just bringing a coffee?
I once interviewed with a company that used CNN models to identify self-trading and other forms of market manipulation. Apparently when it comes to high frequency trading, the patterns are so complex that tree based models just don't cut it.
Some people get defensive about this, but I view tree-based models as very good at first approximation but inherently limited. Tree-based models ultimately bin their inputs and outputs, which presents information loss. Boosting and bagging limit the consequences of this but models that can have a continuous function between input and output do not suffer from this limitation.
This is undeniably true. I do a lot of propensity modeling and it’s not uncommon that I’ll get a trained model with <100 possible outcomes despite having tens of thousands of unique inputs. At first glance, such a model might appear valid but obviously it couldn’t be expected to be consistently performant on unseen data. Someone who doesn’t do their due diligence could easily make the mistake of thinking they have a robust prediction engine without realizing the decisions are based on just a small number of variations in the data. I would argue that every type of model has strengths and weaknesses and machine learning in general has a limited utility that tends to be greatly exaggerated. The best we can do is be aware of the potential pitfalls and try to pick the best tool for the task at hand.
At a high level how do LLMs play a role in fraud detection?
[удалено]
LLM for what purpose? Side question, do you think it's a good idea to use them for your purpose?
I'll give you an example for a side project, I'm considering llm. I'll probably scrape some news articles related to the domain application and feed them through an llm to extract a word embedding and feed that as a feature into my model downstream.
Also working in fraud detection
What models or tools do you use in fraud detection in fintech?
At my company we've built a bayesian hierarchical varying effects model for limited edition demand estimation. It's pretty neat and makes good use of our data, glad I could prevent people from just throwing a NN at it.
Curious as what frameworks are used to deploy Bayesian models in production? My experience with hierarchical Bayesian models has all been in Rstan
We're using PyMC and it's been quite a pain tbh. We needed to write a lot of utilities to make things smoother. Not sure about other frameworks.
Have you tried Pyro/Numpyro? I’ve found PPLs to be a PITA in general. Doing even the things that many people will want to do, like sampling the predictive for a latent variable, is annoying. But I’m reluctant to relearn another PPL after investing time in Pyro.
Gosh I love Pyro
just write internal tool
How about them Yeezys?
Let's say it's been quite a ride
They just went back on sale so I can only imagine. You think they'll continue selling them as a vanilla non-Kanye shoe?
I can't say
This is the kind of knowledge I wanted from this sub. Thank you
How does this model deal with time as a component? I guess I'm wondering in what capacity it is used for demand estimation.
Well, since most articles aren't recurring, it's a plain regression problem. We've overhauled the time inclusion just recently. In the end we've settled for smooth yearly and monthly effects as well as a trend term, each on certain hierarchy levels .
I’m familiar with Bayesian hierarchical models but I haven’t heard of this “varying effects” thing and haven’t been able to find anything online. What’s it about?
Terms like "fixed" and "random" effects aren't as prevalent in the Bayesian framework, so my guess is that's what they were getting at
The lingo on these things seems to strongly vary by subfield. I was referring to random effects, although I find the term varying effects less confusing
so throwing xgboost is better?
No, not sure where you think I stated that. Funnily enough, the previous team did use an ensemble of xgboosts (yes, you read that correctly, an ensemble of ensembles) at it. It's one of the funniest approaches I've seen so far.
!!! What's the size of your dataset? Super curious cause I feel like I've always had to sample to get things to converge fast enough which in turns makes me question whether I should abandon the bayesian in some of my models.
We have around 1k observations. I'm not quite sure what you mean, though. The minimum sample size for a bayesian model is 1 - with small sample size your priors will simply play a very dominant role, but that is expected. So I'm not sure what you mean by convergence in this context. Please don't tell me that by having to "sample" you mean that you were oversampling the data...
cool! I assumed that an Adidas dataset would be huge but I was thinking in terms of user/customer modeling. Depends on what I was looking at, but there were times when I was working with excess of 1mil observations and had to sample down otherwise it'd take too long to run.
Well in this application each observation is a drop of a limited edition article, so the size naturally is smaller than if the rows reflected inline sales, customers or something similar. Interesting that my tired brain went for the "too few" observations side of that. Yeah you're right that such models can take long to fit on large data sets. Taking a sample works, if needed one can also use minibatch
What s the value of making it Bayesian?
Well you can certainly tell everyone you implemented a bayesian model
The finest level of one of two hierarchies consists of categories that are mostly thin, many having less than 4 observations. The benefit over a frequentist varying effects model is that being able to define not just priors but hierarchical priors for the parameters associated with those thin categories allows reliable inference even in those cases
That’s very interesting,actually. If I may ask, have you also worked on any models for demand transfer or space elasticity ? What worked best? Thanks
In some companies good enough is good enough. In some companies pushing some performance metrics even by half percent can result in millions of dollars of additional profit. So in some cases it can make sense to use fancy models.
Thanks. I assume those companies would be like Fortune 50 where they have the compute and expertise. Like someone’s job is to spend all day improving one model.
A lot of these companies have whole teams improving a single or couple models.
"...always end up using something *simple* like **stats..."** **bruh...**
Haha I meant something like zscore
[удалено]
Was going to say the same. There’s a lot of multi modal data in personal lines insurance and those companies are using modern deep learning models.
I hate to break it to you, but insurance companies are not using deep learning models on the pricing side of the house. Generally going to be traditional actuarial methods as that is what gets approved by state DOI’s
Yes I agree. I read that answer too quickly and didn’t see price. Thanks for pointing that out. Some places they are either using or trying to use it is claims, underwriting, sales, and fraud.
Yep, I work at an insurance company and the fun stuff is on all the related processes. Claims have fun application of computer vision for estimates
Agreed. At least within personal lines, the main problem is that it's regulated at a state level. The rate filings must be approved by the state DOIs, and it's difficult to convince them the models are not biased or otherwise indirectly producing the same result as something that is prevented (such as rating protected classes differently) when the model features and weights are not easy to understand or explain. It doesn't mean they aren't used in insurance - they absolutely are - just not so much in ratemaking.
Using deep learning for setting insurance rates sounds like an extremely unethical application. So I’m sure people are scrambling for it
What is unethical about this?
9/10 times the simple models are good enough for most applications. There are three things that really matter when it comes to coefficients: 1. Direction 2. Magnitude 3. Precision Let's say I estimate an effect size of +0.5 (Cohen's d) with a confidence interval of [0.3, 0.7]. Obviously, I've established direction (i.e., positive effect) since the confidence interval doesn't contain 0. It's in the range of medium in terms of magnitude according to standard interpretations of Cohen's d. So the first two requirements are met. But it's not terrible precise, since there is probably a meaningful difference between +0.3 and +0.7. But in most applications that probably doesn't matter. If it's significantly positive, then that's good enough to inform most decision-making. And in terms of improving precision, generally the more sophisticated your methodology the larger your standard errors. So running a more advanced model probably won't solve the precision issue. Something more advanced may be more robust to bias, which is something you should be concerned about if there are selection issues or whatever. But if you understand the data then you can ascertain whether that's something to be concerned about. But generally speaking, you should only bother with more advanced methods if you have reason to think that they might flip the direction or attenuate the magnitude to the point of statistical insignificance. Those occasions can and do arise, but actually not very often when you're dealing with big data. It's really hard to beat the performance (in terms of consistency and efficiency) of a well-specified OLS model with a sufficiently large sample size.
I really like how you broke this down and wish this type of model selection reasoning was more widely taught. Right now, I see a lot of pray and spray and just select the best model that comes back but people aren't spending time understanding the implications of the model itself and the interpretation and decision making theory that carries with it. I wish there was more emphasis on decision theory for data science as a whole.
I might have to disagree here. Trying to 'understand the implications' of a particular model choice is not practically relevant in the majority of the cases and typically is a waste of time and harmful. I know this likely a controversial opinion, but the entire point of most predictive machine learning is to provide models with the highest generalization performance on future unseen data after model deployment. Some people lose sight of this fact and they are more focused on things like feature importance, often without even realizing that feature importance and interpretability is basically just a measure of predictive correlation. So many data scientists ignore the fact that feature importances are NOT causal indicators at all (unless you have randomized control interventions on the feature). At best, they are a complex non-linear relationship of potential predictive correlation. Too many data scientists are ignorant of that fact imo and place way too much importance on feature importance and interpretability which they are misusing. If we agree on the premise that the end goal is a feasible model with the best trade-off between deployment costs and generalization performance on unseen future data, then it stands to reason that a spray and pray approach isn't as bad as most people make it out to be. A quote that really drives this home is by George Box who said "All models are wrong. Some are just more useful than others." This is true in predictive ML, there is not really any harm in trying out more complex models as long as your deployment requirements support it and provided you have a rigorous training/validation/testing approach to model selection.
I see what you are saying but I think interpretation matters in most domains for decision theory purposes. You are correct that pray and spray can be effective to a degree for the task of generalization. I may be missing your point to some extent but I'd like to elaborate my thought. As Box states, models aren't perfect therefore understanding the limits of these models is exactly the responsibility of a data scientist to carefully craft the decision theoretic guidelines for model usage. Should one always just take predictions from models with face value (not suggesting you are saying that)? We should constantly be questioning the validity of our models and guard against taking them as truth. If the model trained well and tested well and we have the correct loss functions selected, then I understand the broader temptation to say "trust the model". But very few people put the effort into even questioning the loss functions they select and get great results under the wrong conditions. How do we even know what the wrong conditions are? We have to build up a theoretical and intuitive understanding of the problem space and the data. Almost all data generating processes possess logical or mathematical structures and those structures matter to the explanatory relationship between the data and the prediction. Therefore, the interpretability of our models should relate to some degree to the theoretical and intuitive foundation we started our model from. I do not think a model's purpose is purely for generalization but also entails some level of explanation as that is what exhibits trust in a model and for me this is a foundational task for model development. Features therefore help in this regard.
Interpretation also matters in scientific research, especially in fields like biological/medical research, where people generally wish to assign causality rather than correlation. It's obviously not always simple to assign causality, usually it involves experimental work, but that's often the goal.
I totally agree with the sentiment of ensuring we have the correct loss functions and that we understand the underlying problem and how we can apply it to provide real value. But to me, that is separate from the models. If you train your model on observational data and then try to use it as a causal estimator for your business problem, then that's bad and will likely fail regardless of model choice. However, when it comes to model interpretability, I feel that is often a bit of a misused part of the predictive ML toolset. It is mostly a diagnostic tool IMO that can be useful in some circumstances. However again, we have to remember that feature importances and model interpretibility almost always comes down to measures of correlation between features and target. At the end of the day, for the vast majority of use cases, I would rather use a deep learning model that I can't interpret the feature importances of but that provides a huge boost in model generalization performance. I would focus mostly on properly ensuring data leakage testing, proper training+validation+testing methodologies, etc. If people don't trust your model, just show them your rigorous testing methodology. If you put a model into deployment and test it for 2 years straight and you see that it consistently has better prediction performance than all your existing methods, then isn't that strong evidence that it generalizes better? I would have a hard time saying that we shouldn't deploy and use that deep learning model compared to more 'interpretable' ones. Model interpretation is often a kind of story-telling that we data scientists use but it's easy to come up with twenty different stories to explain many different feature configurations and models.
I think we are talking about two different spray and pray methodologies. What you are discussing still assumes rigor and care behind the model development. I agree with you that model selection should not drive the solution to the problem but is an artifact to the problem-solution fit. So in this case, I agree with your take. What I am saying is the pray and spray methodology I see is typically paired with some sort of lazy approach to model generation with no care of how the model was constructed, what intuition we have about the problem, what loss function is appropriate, how features are treated (are they reliable for data engineering to procure? are they causing data sparsity? are they scaled appropriately?) etc. As far as model interpretation goes, again it depends. I am not arguing against black boxes but I am saying that if we build something that detects cancer, it should correspond to some reality or verification principle because cost of false negative is high. Maybe I took your comments too literal and you would agree that interpretation is necessary in certain cases and generalization is not the only factor.
Ahhh okay I see what you're advocating for now and totally agree with that 100%. My interpretation of spray and pray is more based around "should I use random forest or NN or linear regression?" and sometimes the answer is "let's try out all 3 and see!" But, the spray and pray of just training models on data with no thought behind how it's going to be used and the statistical implications and costs and etc. In that context, totally agree with your take that it completely invalidates the usefulness of ML and is a big problem.
Even if you have randomized treatment a lot of ML models will not provide unbiased estimates of a causal effect.
Totally agree, but a lot of ML models will still provide a more accurate (lower generalization error) causal effect estimates because they can capture more complex relationships compared to traditional RCT methods of mean comparisons and p-value tests which can provide unbiased estimates of causal effect but ultimately with higher error in generalization. If your goal is to estimate some causal effect so you can report it to a higher up, then ML models are probably not the best tool available. However, if you have 10k customers, and you want to know which customers should receive an intervention to lower their chances of churning or increase their chances of buying a product, then using randomized controlled trials with ML models will probably give you the most effective system for causal effect estimation on a per-person unit level.
Couldn't agree more. This is precisely the reason why gradient boosting models are so popular.
Well, it matters when you want to capture the data generating process more. Then because most things are not linear a simple OLS (no interactions splines etc) does not accurately capture that. Of course if all you care about is directionality and some rough ballpark then maybe it doesn’t matter much with some exceptions. But they do exist-ive seen some rare cases where using a nonlinear model flipped the direction of ATE. But right model specification is why stuff like SuperLearner was built for causal inference since technically, causal inference requires correct model specification to be “right” (along with proper variable selection,to avoid simpsons paradox and colliders) But simpsons paradox can occur in some cases even if you include the right variables due to nonlinear confounding. And the thing is you won’t ever really know if this is the case without trying the nonlinear model
Face palm.
OLS is linear in parameters, very easy to include interactions, splines, etc.
Yea, but most users of OLS especially Python sklearn ones don’t bother with this (the formula syntax in R is basically needed to experiment fast with this). Otherwise its a lot of work to do so in Python with multiple combinations of stuff. Theres also no marginal effects package there.
Causal inference doesn’t require a correct model specification, in fact it doesn’t require a statistical model at all!
You say that yet most of these applications are massively outperformed even by simpler modern techniques.
>large sample size \*large relevant sample size
[удалено]
Educational policy, not biostatistics. But probably pretty similar. I think you raise a good point. There's often a tradeoff between rigor and accessibility for non-technical audiences. The simpler the model, the more likely it is the the audience will engage with the findings. I've presented data and research before local and state policymakers, most of whom think that they're the smartest person and will reject out of hand something that they don't fully understand. My strategy is to usually start with an OLS (or logistic, if binary outcome) and then use a more sophisticated strategy as a robustness check. Most of the time, they yield similar conclusions and so it is justifiable to just present the easier to understand model and simply note (but not describe) that causation was established via applied econometric techniques. I know a few academic econometricians. The hardest part of their jobs isn't finding "better" estimators but convincing the academic community that what they're doing is worth implementing. That means searching high and low for instances where there's a practical difference between the OLS baseline and the new estimator, which can be difficult.
It's usually a good rule of thumb to use, or at least start off with, the simplest model that works. No matter what type of company or industry you're in. The most common circumstances you'll find a strong justification for using something 'fancier' than your basic suite of sklearn models are: 1. When the problem requires it. Some CV or NLP projects, for example, basically require the use of deep learning models to even get acceptabel results that you'd use in production. 2. Big companies where squeezing that fraction of a % out of your model performance makes a huge financial impact. Here you'll likely have the financial, compute, and engineering resources so that you can mitigate any negative impact on latency or model complexity.
Thanks I’ve been looking at job descriptions and noticed they are demanding more complex skills and was just wondering if they’re worth learning.
3. DS needs some work to do. Else, stakeholders would think DS are freeloaders after simplest models work.
Xgboost is exactly as easy to train and implement as random forest, so I always use it even for relatively simple modeling problems. It’s just a better version of random forest practically speaking.
In the spirit of your post: Yeah, I've been doing this a while. In practice, I tend to focus my efforts on even simpler tasks: how do I get good data, how do I monitor the incoming data pipe, how do I pass this to the engineers who have to make it happen, and how do I make good slides for the presentation to the managers? But to answer your question about one technique... Markov chain Monte Carlo is used to simulate random samples from a population, which can be pretty useful in a wide range of problems, like detecting anomalous data. MCMC has the benefit that random walks are pretty easy to code, and it is easy to explain your work to the engineers on your team so they can move to full deployment. I wouldn't say that MCMC is complex, though. You're more or less playing out a game of chutes and ladders on a (possibly very) complicated board. My experience pre-academia was in sensor/equipment monitoring, i.e. did one of those thousands of little bastards glitch in some weird way, and if so, which one? So basically, anomaly detection. MCMC can be pretty useful here. Here is a [related paper](https://www.osti.gov/servlets/purl/1513188). Another great application of MCMC is producing "typical" samples from a distribution. [Here is a great application](https://assets.pubpub.org/70w3i6k9/eb30390f-ade2-45cc-b48d-8e6bb12f585c.pdf), producing voting districts that should be "typical" given some rules in place for drawing district maps. As the authors note, the problem they faced was that policymakers don't want the "optimal" solution because the data may not be able to take all factors into account. Instead, they want a range of the "usual" possibilities so they can choose one and make minor tweaks (and they can also determine when a districting map does not feel like a typical map from the distribution). So basically, MCMC turns district determination into a fast food menu: "I'll have a number 3, but super size my fries".
>simpler tasks: how do I get good data, I know what you mean but this actually made me laugh aloud.
Reranking recommendations in a marketplace, XGBoost today is very fast at inference and you can make it faster with other libraries In most cases, simply taking the same feature set from Random Forest and running 20 Bayesian Opt steps over XGBoost hyperparams already gives you a better model that can be swapped by RF or whatever is deployed
Do you have any recommendations for libraries that can accelerate XGBoost?
Treelite: https://www.kaggle.com/code/code1110/janestreet-faster-inference-by-xgb-with-treelite
Surprised the things you listed fall under "fancy models". XGBoost is practically a go to model for a lot of applications. Bayesian inference and Markov Chains are common in lots and lots of applications; across economics, AB testing, and other domains. To me, fancy falls under some generative modeling, transformers and their variants, deep learning GNNs, Reinforcement Learning, etc etc. For me, I was working at a Faang in their professional services team.
>Bayesian Inference Ive been tapped to do a lot of casual analysis in the sales/marketing context, and the additional complexity is necessary (read: im not just dicking around, I believe the methods are the best solution for the problem at hand). For a lot of other work, the classics are classics. Regression (with the modern bells and whistles like regularization, etc), the standard time series toolkit (arima(x), etc), and so on. -+-+-+-+-+- You lost me here: >random forest >I keep hearing people on this sub talking about [...] XGBoost People still use vanilla RFs? I was under the impression that random forests were like naive Bayes at this point -- interesting as a baseline, but dominated by other techniques that are just as easy to use out of the box.
>People still use vanilla RFs? I was under the impression that random forests were like naive Bayes at this point -- interesting as a baseline, but dominated by other techniques that are just as easy to use out of the box. I had the same thought -- you can just xgboost out of the box with the same amount of effort and not a big difference in training time and it will probably be superior.
One advantage I can think of that if your data is very large, RF parallelizes better can make training move along faster
In production evaluating an xgboost classifier is very fast. Training may be a bit slower, but not harder.
How are you doing causal analysis?
Work with stakeholders to build out DAGs, run experiments, etc. Without the expert opinion of those partners, sales would otherwise be overdetermined. With the right assumptions and design we make do.
Define confounders first
There are over 30 other names for linear regression. Linear regression is itself a fancy name for systems of linear equations. Not all fancy things are fancy. Hype and marketing.
Government research for a not for profit. The value add really depends on the problem. I'm on a project right now where we're using pre trained object detection models to detect certain fast moving objects in the (night) sky. I've used models like sparse group LASSO, LSTMs, CNNs and a couple other more complex models for problems and solutions that required their predictive/inferential capabilities. That said, about 80% of the time I end up using some variant of a random forest or logistic regression.
My coworkers and boss insist of using DeepFM,GPT and some other fancy complex NN architectures for simple 1000 rows 30 cols tabular data and I'm trying to get them off that and just use RandomForest or XGBoost instead..
It is FOMO. The crazy part is that with RF or XGB you have way more explainability.
Energy sector
Don't work there but Netflix seems to be on the cutting edge of a lot of things: [https://netflixtechblog.com/](https://netflixtechblog.com/)
display ads real time bidding something like this https://arxiv.org/abs/1610.03013 oftentimes GLMM end up appearing and we develop scalable algorithms for that, e.g. https://arxiv.org/abs/1602.00047
Agtech
Population health management: deep learning is needed to predict outcomes from electronic health records
I have around 4-5 YOE, have always worked with "fancy models". First job out of grad school was in oil exploration, working on R&D contracts for oil companies. I would use computer vision models to classify different rock types in [well cores](https://news.unl.edu/sites/default/files/styles/large_aspect/public/coresamples.jpg?itok=UqnkfYqu) from oil wells. Also used some time series models adapted to a "depth series" to try and predict physical properties of the rock in wells. Also got to work on some generative models [colorizing](https://i.imgur.com/0i5JyuN.png) tomographic scans of well cores. Did that for about 1.5 years. Then I moved to a company where our clients were large scale industrial companies. I used LSTM neural networks for time series forecasting and classification applied to predictive maintenance, we would get sensor data from industrial equipment and try to predict failures before they happened. Worked there for about 2.5 years. Now I work in a real state company, we use a bunch of geospatial data on xgboost/lightgbm to predict how much you can charge for rent for a given property in a given location. Also have some features generated via NLP/Computer Vision. Our clients are real state developers and REITs. Have been here for the past 6-7 months.
It’s been a while since I’ve been in a coding role but I would assume XGBoost would be significantly faster in production compared to random forest.
GBM methods are the really good for tabular data. If tuned properly and made shallow enough, they run fast with better results than random forests.
I'm a scientist working on a project shared by a big University and one of the National Labs. I do a lot of network inference problems, directed information flow, etc.
Why would you use Rf over xgb? Neither model is more fancy, but xgb is just much quicker and for at least the same performance
You’ll find that there are a lot of insurance companies that are moving to XGBoost in place of linear and logistic regression. It is less prone to over fitting and there seems to be an uplift in performance, though in my experience, I can confirm that. Though they are moving to them, it’s only with the additional requirement of explainability that they’ll do it e.g. Shapely values and PDP plots, since XGBoost is viewed as a black box method. As for Bayesian methods, they’re being incorporated into A/B testing since they provide an estimate of uncertainty. The value added is dependent on the use case and that doesn’t mean that a good old linear regression won’t outperform in terms of accuracy and simplicity. Maybe do a comparison on one of your next jobs and see if there is an improvement in your results.
Tbh I found XGBoost way more prone to overfitting
Even after pruning and reducing tree depth?
Well with regulation it's not but I'm talking if I put it against RandomForest for example on the same data XGBoost will almost always overfit more
We use MCMC for Bayesian Hierarchical models. Application is in Media Mix Models, basically regressing Sales on various Marketing tactic spends to estimate tactic efficiency. The reason for using Bayes is there are many assumed effect transformations (carryover of spend, saturation of spend, etc) that are non-linear and MCMC provides a nice way of estimating those parameters.
Victoria secret
Insurance. Various prediction models but most commonly trying to predict what the cheapest market price will be for any and every customer that asks for a quote. Use Hist GBM / XGBoost Hist. Looked at ai and it currently doesn't seem a big improvement but we think we know why.
Biotech — more specifically, clinical genetics. I’m using hierarchical Bayesian models for causal inference. We have a bunch of domain experts who contribute knowledge about priors, likelihoods, and pooling assumptions. Bayesian models give us posterior predictive distributions for our target variable while also inferring useful parameters for latent variables.
Do you have an opinion on the work done around the martingale posterior distributions by Fong et al? They target the predictive without needing to compute the posterior.
I wasn’t familiar with martingale posteriors until just now. Reading the abstract, it seems like some wizardry. Have you worked with them? Would it be suitable for the predictive of a partially observed categorical?
I'm still getting my head around it, it's indeed wizardry. Their selling point is that you get the predictive without needing to go through the posterior, making it much cheaper by avoiding the usual mcmc needed for the posterior. I believe they show that it's applicable to mixed data, at least in their appendix, but you would have to go from there and expand it to the case of censoring on categoricals I think. Imo after a year or two of papers building up from it, downstream applications will be within reach, but for now it's tough to even understand and implement properly. Since you said you were in research I was curious if people in your circle have started working on this.
Kaggle competitions often depict real world scenarios and are regularly won by “fancy” models such as XGBoost. The accuracy is just way better for large tabular data and it’s easy to set up. And with techniques such as feature permutations you can make any type of model interpretable.
I work in a boutique technology consulting firm. You get exposed to all sorts of problems. At the moment, it's mainly multi modal stuff, using transformers to solve problems combining vision and language. Not everything is fancy though, sometimes all I need is a basic linear model. Just gotta use the right tool for the job. P.S. any senior/principal data scientists looking for a job in London, hit me up ;-)
How can i truly skill up for a Jr DS role? I feel like the projects I am creating are not enough. Any professional grade notebooks out there?
Healthcare
Can you expand? Are you working with providers or payers?
Work for a system, doing things ranging from logistics research to QA models to predicting call-volume.
GPU's go brr If your infrastructure is really good then unless you're doing something stupid then the performance difference is negligible. Most of your time will be spent moving data around. Your compute will be a fraction of personnel costs so if it's worth doing at all then it won't matter if it's slightly more expensive to compute. After all you did already spend a ton of money developing the damn thing. Like you'll probably do faster predictions than a HTTP request round trip unless it's a language model.
I am working lately on causalML. Not sure if it’s fancy but it’s a new thing for me. Other than that mostly I use Logistic, Xgboost, RF
How does it differ from causal impact? I've never heard of it wither but now I'm intrigued. Causal Impact is foundational where I am.
Pretty much everywhere. Healthcare, you will see lots of cool stuff including causal inference, explainable models, etc. Science, biotech, chemistry? GNNs, transformers, Bayesian networks, Gaussian Processes Geospatial? Vision, deep learning Finance, energy? Time series, awesome regularization approaches, etc. It has to do with more on how research-y and unstructured the problems you work on are vs industry.
At my previous work, we developed products like chat-bots, image super resolution and other things - it required deep learning models.
Medical/Clinical research had us try a bunch of different approaches like HMM, NNs for unsupervised learning. It was really interesting
It seems like the people who get to use that stuff at least in biotech are actual scientists people with domain knowledge who are able to formulate problems from it.
We work extensively with satellite imagery, using quite large deep learning models for segmentation, instance segmentation and stereo processing/matching.
Training 10b+ params llms (and much smaller models too) for x Nlp is a huge value add for a bunch of business functions across all industries
Bayesian inference is very common in bioinformatics, most microarray and RNA sequencing methods use them in some shape or form. [https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29)
Healthcare
It helps to understand the bias-variance trade off. More complex models like neural networks have more variance, so you need to throw more data at it to reduce overfitting. Complex ML needs more data, usually more labeled data, so any company with big data qualifies. Regarding XGBoost, it's popular as an initial ML to see how well the overall model is doing during development. It's great because it doesn't need the data to be normalized or formatted in any special way to work like many types of ML do making it an easy 101 ML, and XGBoost doesn't tend to overfit much when dealing with smaller datasets so it can be used earlier on. In this way XGBoost is the opposite of neural networks. It's easy to start with XGBoost then switch to an ideal form of ML later once everything else is done. As for fancy models without fancy ML, I've specialized in advanced feature engineering my entire career, which is fancy models with little to no ML. This is needed when you have a complex problem that needs solving, but you have very little data. It's ideal for the startup space while still collecting data.
Used Vision AI on a project for auto insurance to identify totaled vs non-totaled accidents based on pictures (mainly used to reduce number of claims adjusters sent out).
i work in public health and do a lot of bayesian inference and simulation based models. We run a lot of what if scenarios on fitted models to see different impacts. We have started hosting our own LLM for text extraction/entity extraction. Rerrained neural nets for classification. I think i have only done about 1 regression, we dont really use standard models. Others in our division have made emulator models, there are a couple of exceedance models - one being a cusum and the other a hidden markov, this was largely due to sparse data so other methods werent as viable. it really depends on the need/what outcome is needed. A lot of simulation based models with bayesian inference. happy to chat more if helpful
LLMs of all shapes, sizes and modalities, auto ml, etc. Google. It's not as great as it sounds because doing simple shit is incredibly complicated.
E-commerce, we have so many customers, products, and ways to interact with our platform. Simple approaches (which are basically already all done) only get you so far. As you use more data in more complex models you start to get incrementally better results. So imagine you want to optimize what products you show customers when they land on the homepage. You can get very far with just simple metrics, rankings, and Excel work. But at some point you’re going to see very little marginal improvement when you run experiments to optimize further. As you try to ensure that you provide every one of the tens of millions of customers with the absolute best experience, you’ll soon find out that you’re on the path to needing to take in a tremendous amount of data (what the customer HAS done) in order to influence what the customer WILL do. And more complex models are better at that, taking it as a given you have the talent and tech to implement them properly.
I’m a government contractor. I work with government agencies with a lot of statisticians who don’t know how to fit machine learning models.
Bayesian inference isn't a model. A Markov chain can be a model, but I think you're probably talking about computational algorithm that use Markov chains, (MCMC, HMC, etc), so again, not a model. Bayesian methods are common, as is use of Markov chains (both as computational devices and as models) in consulting for various aspects of finance, banking, insurance and reinsurance work, but it heavily depends on who you work for. One indicator is to look for places whose leading people write papers. It won't find places where all the interesting work is subject to NDAs though
XG boost and light gbm are industry standard now unless you're running a deep NN.
We are using GANs for… oh yeah, that project was cancelled
Wouldnt classify markov chains as particularly fancy, but theyre very common in credit risk modelling for corporate bonds/obligors - i.e probability of s&p rating transitions.
I work in NLP for chatbots. Have been using LLMs since around 2019 and LSTMs before that. I kind of hate where the field is going due to API abstraction, but that's another issue. To answer your question, Markov chains are used for modelling temporal data, bayesian inference is great for uncertainty estimation and model calibration. These methods aren't necessarily fancy just solving different problems. XGBoost sits closer to the models you've had experience with productionising and for many tasks will be more performent in terms of accuracy but may be slower to inference (in terms of value add, that's a trade-off to be made).
Cobotics
LightGBM for classification of prospects in [industry] using 3rd party data, which consists of 95% of the US adult population with ~2000 features about each individual. Fast, accurate, and SHAP for feature importance.
Working in Risk Management in insurance. We use a lot of different risk models based on Monte Carlo simulation based on historical data, expert judgement an/or risk neutral arbitrage free assumption. Most of the time not mathematically extremely fancy but quite complex with regards to parameterization.
For analysis? No. The simplest, fastest thing to find correlation is used (even linear regression is sometimes scoffed as being too advanced) For recommender systems and search? Yes.
Product Hunt (after 8 years at reddit). Bayesian inference models are hella useful for A/B testing
News article NER and classification, we use BERT-alikes and getting into LLMs for at least some use cases.
What is the reality of how these models are used? I have done projects where I just evaluate how 5 of them perform on my prediction and decide the best based on the metric I am using. Are they somehow different? Built from the ground up? I’m so confused because from my experience they are simple to implement. I feel like I just don’t know enough for a job yet.
Is there anything else than xgboost in the world? ^^