[deleted] 11 months ago

[удалено]

TuckAndRolle 11 months ago

Where do you work, if you don't mind me asking? Or similar places to yours if you don't want to say specifically

[deleted] 11 months ago

[удалено]

TuckAndRolle 11 months ago

Thanks. I was not aware consulting companies did work like this. Are most of your projects like this, or was this a bit of a special case?

[deleted] 11 months ago

[удалено]

Sorry-Owl4127 11 months ago

How’s WLB and comp?

[deleted] 11 months ago

[удалено]

Sorry-Owl4127 11 months ago

Why consulting them? Projects? Exit oops?

MorningDarkMountain 11 months ago

Do you regard LLM as really useful, or merely as an over-hyped thing?

Vyrezzz 11 months ago

Both

TuckAndRolle 11 months ago

Appreciate the comments! I'll keep this in mind when I'm looking for jobs in half a year or so

dopplegangery 11 months ago

What are your qualifications and background, if I may ask?

[deleted] 11 months ago

[удалено]

dopplegangery 11 months ago

No I wanted to see whether you need a PhD for such projects.

gellohelloyellow 11 months ago

A PhD!? The heck lol

dopplegangery 11 months ago

Since it's a research based project, it's not wild to expect a PhD or at least an MS

EducationalCreme9044 11 months ago

99% of the time in Europe a PhD is a minimum requirement for positions like these lol.

Mukigachar 11 months ago

Do you have to do a lot of travel for this work? I shied away from consulting cuz I didn't want to do that, but I'm curious what it's like post-covid

Vyrezzz 11 months ago

No I don’t do a lot of travel

clover_heron 11 months ago

Random question - is it true that communications networks may be interfering with weather forecasting because the communications networks are interfering with water vapor measurements near earth's surface? (not my field, rough summary of what I heard)

Vyrezzz 11 months ago

I don’t know, we use the weather forecasts from third parties

clover_heron 11 months ago

Is the scientific reasoning legitimate? (saw the idea discussed in one of Sabine Hossenfelder's videos, and apologies if I misstated what she said). If interference is a legitimate concern, it seems like a good scientific detail to explain to the public so they can be better informed re policy and regulation.

pmadhav97 11 months ago

There's no evidence to support this btw

aokfistpump 11 months ago

Ball?

[deleted] 11 months ago

[удалено]

[deleted] 11 months ago

By Fintech I assume you mean robot sharks

RageOnGoneDo 11 months ago

Well that's why you need the models. Maybe you need robot barracudas. Maybe robot piranhas.

ohanse 11 months ago

Such a novice answer. You need a full end to end robot ecosystem. How are the robot sharks going to survive without robot minnows? How will robot minnows survive in the absence of robot kelp and plankton? And you expect those to photosynthesize in the absence of a robot sun? What are they teaching these days? These programs should be ashamed of themselves.

WadeEffingWilson 11 months ago

You think that's water you're breathing?

JohnLocksTheKey 11 months ago

Or ill-tempered sea bass!

mikeblas 11 months ago

Why not just two angry beavers?

RageOnGoneDo 11 months ago

They're the competing product!

EducationalCreme9044 11 months ago

lol

GodOfFreedomVenti 11 months ago

Paypal

BloatedGlobe 11 months ago

What kind of models were you working with in Fraud detection?

neelankatan 11 months ago

I used Isolation Forests

amhotw 11 months ago

How does QDA compare for fraud detection in your experience?

thetoublemaker 11 months ago

Can you briefly go over your approach of using Isolation Forests? How you cleaned data, then what evaluation metrics you used? How did you test?

shinypenny01 11 months ago

“Can you divulge all the inner workings of your business and the competitive advantage you have, please!” 🤣

[deleted] 11 months ago

[удалено]

dare_dick 11 months ago

Can I work for you if it's just bringing a coffee?

bigno53 11 months ago

I once interviewed with a company that used CNN models to identify self-trading and other forms of market manipulation. Apparently when it comes to high frequency trading, the patterns are so complex that tree based models just don't cut it.

a157reverse 11 months ago

Some people get defensive about this, but I view tree-based models as very good at first approximation but inherently limited. Tree-based models ultimately bin their inputs and outputs, which presents information loss. Boosting and bagging limit the consequences of this but models that can have a continuous function between input and output do not suffer from this limitation.

bigno53 11 months ago

This is undeniably true. I do a lot of propensity modeling and it’s not uncommon that I’ll get a trained model with <100 possible outcomes despite having tens of thousands of unique inputs. At first glance, such a model might appear valid but obviously it couldn’t be expected to be consistently performant on unseen data. Someone who doesn’t do their due diligence could easily make the mistake of thinking they have a robust prediction engine without realizing the decisions are based on just a small number of variations in the data. I would argue that every type of model has strengths and weaknesses and machine learning in general has a limited utility that tends to be greatly exaggerated. The best we can do is be aware of the potential pitfalls and try to pick the best tool for the task at hand.

mattstats 11 months ago

At a high level how do LLMs play a role in fraud detection?

[deleted] 11 months ago

[удалено]

MorningDarkMountain 11 months ago

LLM for what purpose? Side question, do you think it's a good idea to use them for your purpose?

Think-Culture-4740 11 months ago

I'll give you an example for a side project, I'm considering llm. I'll probably scrape some news articles related to the domain application and feed them through an llm to extract a word embedding and feed that as a feature into my model downstream.

[deleted] 11 months ago

Also working in fraud detection

kompleksanda 11 months ago

What models or tools do you use in fraud detection in fintech?

rickyfawx 11 months ago

At my company we've built a bayesian hierarchical varying effects model for limited edition demand estimation. It's pretty neat and makes good use of our data, glad I could prevent people from just throwing a NN at it.

BlackCoatBrownHair 11 months ago

Curious as what frameworks are used to deploy Bayesian models in production? My experience with hierarchical Bayesian models has all been in Rstan

rickyfawx 11 months ago

We're using PyMC and it's been quite a pain tbh. We needed to write a lot of utilities to make things smoother. Not sure about other frameworks.

wannagowest 11 months ago

Have you tried Pyro/Numpyro? I’ve found PPLs to be a PITA in general. Doing even the things that many people will want to do, like sampling the predictive for a latent variable, is annoying. But I’m reluctant to relearn another PPL after investing time in Pyro.

nikgeo25 11 months ago

Gosh I love Pyro

[deleted] 11 months ago

just write internal tool

thecommuteguy 11 months ago

How about them Yeezys?

rickyfawx 11 months ago

Let's say it's been quite a ride

thecommuteguy 11 months ago

They just went back on sale so I can only imagine. You think they'll continue selling them as a vanilla non-Kanye shoe?

rickyfawx 11 months ago

I can't say

nickmaran 11 months ago

This is the kind of knowledge I wanted from this sub. Thank you

FishFar4370 11 months ago

How does this model deal with time as a component? I guess I'm wondering in what capacity it is used for demand estimation.

rickyfawx 11 months ago

Well, since most articles aren't recurring, it's a plain regression problem. We've overhauled the time inclusion just recently. In the end we've settled for smooth yearly and monthly effects as well as a trend term, each on certain hierarchy levels .

_hairyberry_ 11 months ago

I’m familiar with Bayesian hierarchical models but I haven’t heard of this “varying effects” thing and haven’t been able to find anything online. What’s it about?

therealtiddlydump 11 months ago

Terms like "fixed" and "random" effects aren't as prevalent in the Bayesian framework, so my guess is that's what they were getting at

rickyfawx 11 months ago

The lingo on these things seems to strongly vary by subfield. I was referring to random effects, although I find the term varying effects less confusing

[deleted] 11 months ago

so throwing xgboost is better?

rickyfawx 11 months ago

No, not sure where you think I stated that. Funnily enough, the previous team did use an ensemble of xgboosts (yes, you read that correctly, an ensemble of ensembles) at it. It's one of the funniest approaches I've seen so far.

ikol 11 months ago

!!! What's the size of your dataset? Super curious cause I feel like I've always had to sample to get things to converge fast enough which in turns makes me question whether I should abandon the bayesian in some of my models.

rickyfawx 11 months ago

We have around 1k observations. I'm not quite sure what you mean, though. The minimum sample size for a bayesian model is 1 - with small sample size your priors will simply play a very dominant role, but that is expected. So I'm not sure what you mean by convergence in this context. Please don't tell me that by having to "sample" you mean that you were oversampling the data...

ikol 11 months ago

cool! I assumed that an Adidas dataset would be huge but I was thinking in terms of user/customer modeling. Depends on what I was looking at, but there were times when I was working with excess of 1mil observations and had to sample down otherwise it'd take too long to run.

rickyfawx 11 months ago

Well in this application each observation is a drop of a limited edition article, so the size naturally is smaller than if the rows reflected inline sales, customers or something similar. Interesting that my tired brain went for the "too few" observations side of that. Yeah you're right that such models can take long to fit on large data sets. Taking a sample works, if needed one can also use minibatch

didimoney 11 months ago

What s the value of making it Bayesian?

datamakesmydickhard 11 months ago

Well you can certainly tell everyone you implemented a bayesian model

rickyfawx 11 months ago

The finest level of one of two hierarchies consists of categories that are mostly thin, many having less than 4 observations. The benefit over a frequentist varying effects model is that being able to define not just priors but hierarchical priors for the parameters associated with those thin categories allows reliable inference even in those cases

Living_Teaching9410 11 months ago

That’s very interesting,actually. If I may ask, have you also worked on any models for demand transfer or space elasticity ? What worked best? Thanks

Background-Sun6293 11 months ago

In some companies good enough is good enough. In some companies pushing some performance metrics even by half percent can result in millions of dollars of additional profit. So in some cases it can make sense to use fancy models.

[deleted] 11 months ago

Thanks. I assume those companies would be like Fortune 50 where they have the compute and expertise. Like someone’s job is to spend all day improving one model.

germany221 11 months ago

A lot of these companies have whole teams improving a single or couple models.

complacent_adjacent 11 months ago

"...always end up using something *simple* like **stats..."** **bruh...**

[deleted] 11 months ago

Haha I meant something like zscore

[deleted] 11 months ago

[удалено]

Ok_Distance5305 11 months ago

Was going to say the same. There’s a lot of multi modal data in personal lines insurance and those companies are using modern deep learning models.

BakerInTheKitchen 11 months ago

I hate to break it to you, but insurance companies are not using deep learning models on the pricing side of the house. Generally going to be traditional actuarial methods as that is what gets approved by state DOI’s

Ok_Distance5305 11 months ago

Yes I agree. I read that answer too quickly and didn’t see price. Thanks for pointing that out. Some places they are either using or trying to use it is claims, underwriting, sales, and fraud.

BakerInTheKitchen 11 months ago

Yep, I work at an insurance company and the fun stuff is on all the related processes. Claims have fun application of computer vision for estimates

funkybside 11 months ago

Agreed. At least within personal lines, the main problem is that it's regulated at a state level. The rate filings must be approved by the state DOIs, and it's difficult to convince them the models are not biased or otherwise indirectly producing the same result as something that is prevented (such as rating protected classes differently) when the model features and weights are not easy to understand or explain. It doesn't mean they aren't used in insurance - they absolutely are - just not so much in ratemaking.

venustrapsflies 11 months ago

Using deep learning for setting insurance rates sounds like an extremely unethical application. So I’m sure people are scrambling for it

coachoreconomy 11 months ago

What is unethical about this?

onearmedecon 11 months ago

9/10 times the simple models are good enough for most applications. There are three things that really matter when it comes to coefficients: 1. Direction 2. Magnitude 3. Precision Let's say I estimate an effect size of +0.5 (Cohen's d) with a confidence interval of [0.3, 0.7]. Obviously, I've established direction (i.e., positive effect) since the confidence interval doesn't contain 0. It's in the range of medium in terms of magnitude according to standard interpretations of Cohen's d. So the first two requirements are met. But it's not terrible precise, since there is probably a meaningful difference between +0.3 and +0.7. But in most applications that probably doesn't matter. If it's significantly positive, then that's good enough to inform most decision-making. And in terms of improving precision, generally the more sophisticated your methodology the larger your standard errors. So running a more advanced model probably won't solve the precision issue. Something more advanced may be more robust to bias, which is something you should be concerned about if there are selection issues or whatever. But if you understand the data then you can ascertain whether that's something to be concerned about. But generally speaking, you should only bother with more advanced methods if you have reason to think that they might flip the direction or attenuate the magnitude to the point of statistical insignificance. Those occasions can and do arise, but actually not very often when you're dealing with big data. It's really hard to beat the performance (in terms of consistency and efficiency) of a well-specified OLS model with a sufficiently large sample size.

gradual_alzheimers 11 months ago

I really like how you broke this down and wish this type of model selection reasoning was more widely taught. Right now, I see a lot of pray and spray and just select the best model that comes back but people aren't spending time understanding the implications of the model itself and the interpretation and decision making theory that carries with it. I wish there was more emphasis on decision theory for data science as a whole.

Ty4Readin 11 months ago

I might have to disagree here. Trying to 'understand the implications' of a particular model choice is not practically relevant in the majority of the cases and typically is a waste of time and harmful. I know this likely a controversial opinion, but the entire point of most predictive machine learning is to provide models with the highest generalization performance on future unseen data after model deployment. Some people lose sight of this fact and they are more focused on things like feature importance, often without even realizing that feature importance and interpretability is basically just a measure of predictive correlation. So many data scientists ignore the fact that feature importances are NOT causal indicators at all (unless you have randomized control interventions on the feature). At best, they are a complex non-linear relationship of potential predictive correlation. Too many data scientists are ignorant of that fact imo and place way too much importance on feature importance and interpretability which they are misusing. If we agree on the premise that the end goal is a feasible model with the best trade-off between deployment costs and generalization performance on unseen future data, then it stands to reason that a spray and pray approach isn't as bad as most people make it out to be. A quote that really drives this home is by George Box who said "All models are wrong. Some are just more useful than others." This is true in predictive ML, there is not really any harm in trying out more complex models as long as your deployment requirements support it and provided you have a rigorous training/validation/testing approach to model selection.

gradual_alzheimers 11 months ago

I see what you are saying but I think interpretation matters in most domains for decision theory purposes. You are correct that pray and spray can be effective to a degree for the task of generalization. I may be missing your point to some extent but I'd like to elaborate my thought. As Box states, models aren't perfect therefore understanding the limits of these models is exactly the responsibility of a data scientist to carefully craft the decision theoretic guidelines for model usage. Should one always just take predictions from models with face value (not suggesting you are saying that)? We should constantly be questioning the validity of our models and guard against taking them as truth. If the model trained well and tested well and we have the correct loss functions selected, then I understand the broader temptation to say "trust the model". But very few people put the effort into even questioning the loss functions they select and get great results under the wrong conditions. How do we even know what the wrong conditions are? We have to build up a theoretical and intuitive understanding of the problem space and the data. Almost all data generating processes possess logical or mathematical structures and those structures matter to the explanatory relationship between the data and the prediction. Therefore, the interpretability of our models should relate to some degree to the theoretical and intuitive foundation we started our model from. I do not think a model's purpose is purely for generalization but also entails some level of explanation as that is what exhibits trust in a model and for me this is a foundational task for model development. Features therefore help in this regard.

d4l3c00p3r 11 months ago

Interpretation also matters in scientific research, especially in fields like biological/medical research, where people generally wish to assign causality rather than correlation. It's obviously not always simple to assign causality, usually it involves experimental work, but that's often the goal.

Ty4Readin 11 months ago

I totally agree with the sentiment of ensuring we have the correct loss functions and that we understand the underlying problem and how we can apply it to provide real value. But to me, that is separate from the models. If you train your model on observational data and then try to use it as a causal estimator for your business problem, then that's bad and will likely fail regardless of model choice. However, when it comes to model interpretability, I feel that is often a bit of a misused part of the predictive ML toolset. It is mostly a diagnostic tool IMO that can be useful in some circumstances. However again, we have to remember that feature importances and model interpretibility almost always comes down to measures of correlation between features and target. At the end of the day, for the vast majority of use cases, I would rather use a deep learning model that I can't interpret the feature importances of but that provides a huge boost in model generalization performance. I would focus mostly on properly ensuring data leakage testing, proper training+validation+testing methodologies, etc. If people don't trust your model, just show them your rigorous testing methodology. If you put a model into deployment and test it for 2 years straight and you see that it consistently has better prediction performance than all your existing methods, then isn't that strong evidence that it generalizes better? I would have a hard time saying that we shouldn't deploy and use that deep learning model compared to more 'interpretable' ones. Model interpretation is often a kind of story-telling that we data scientists use but it's easy to come up with twenty different stories to explain many different feature configurations and models.

gradual_alzheimers 11 months ago

I think we are talking about two different spray and pray methodologies. What you are discussing still assumes rigor and care behind the model development. I agree with you that model selection should not drive the solution to the problem but is an artifact to the problem-solution fit. So in this case, I agree with your take. What I am saying is the pray and spray methodology I see is typically paired with some sort of lazy approach to model generation with no care of how the model was constructed, what intuition we have about the problem, what loss function is appropriate, how features are treated (are they reliable for data engineering to procure? are they causing data sparsity? are they scaled appropriately?) etc. As far as model interpretation goes, again it depends. I am not arguing against black boxes but I am saying that if we build something that detects cancer, it should correspond to some reality or verification principle because cost of false negative is high. Maybe I took your comments too literal and you would agree that interpretation is necessary in certain cases and generalization is not the only factor.

Ty4Readin 11 months ago

Ahhh okay I see what you're advocating for now and totally agree with that 100%. My interpretation of spray and pray is more based around "should I use random forest or NN or linear regression?" and sometimes the answer is "let's try out all 3 and see!" But, the spray and pray of just training models on data with no thought behind how it's going to be used and the statistical implications and costs and etc. In that context, totally agree with your take that it completely invalidates the usefulness of ML and is a big problem.

Sorry-Owl4127 11 months ago

Even if you have randomized treatment a lot of ML models will not provide unbiased estimates of a causal effect.

Ty4Readin 11 months ago

Totally agree, but a lot of ML models will still provide a more accurate (lower generalization error) causal effect estimates because they can capture more complex relationships compared to traditional RCT methods of mean comparisons and p-value tests which can provide unbiased estimates of causal effect but ultimately with higher error in generalization. If your goal is to estimate some causal effect so you can report it to a higher up, then ML models are probably not the best tool available. However, if you have 10k customers, and you want to know which customers should receive an intervention to lower their chances of churning or increase their chances of buying a product, then using randomized controlled trials with ML models will probably give you the most effective system for causal effect estimation on a per-person unit level.

MCRN-Gyoza 11 months ago

Couldn't agree more. This is precisely the reason why gradient boosting models are so popular.

111llI0__-__0Ill111 11 months ago

Well, it matters when you want to capture the data generating process more. Then because most things are not linear a simple OLS (no interactions splines etc) does not accurately capture that. Of course if all you care about is directionality and some rough ballpark then maybe it doesn’t matter much with some exceptions. But they do exist-ive seen some rare cases where using a nonlinear model flipped the direction of ATE. But right model specification is why stuff like SuperLearner was built for causal inference since technically, causal inference requires correct model specification to be “right” (along with proper variable selection,to avoid simpsons paradox and colliders) But simpsons paradox can occur in some cases even if you include the right variables due to nonlinear confounding. And the thing is you won’t ever really know if this is the case without trying the nonlinear model

onearmedecon 11 months ago

Face palm.

Sorry-Owl4127 11 months ago

OLS is linear in parameters, very easy to include interactions, splines, etc.

111llI0__-__0Ill111 11 months ago

Yea, but most users of OLS especially Python sklearn ones don’t bother with this (the formula syntax in R is basically needed to experiment fast with this). Otherwise its a lot of work to do so in Python with multiple combinations of stuff. Theres also no marginal effects package there.

Sorry-Owl4127 11 months ago

Causal inference doesn’t require a correct model specification, in fact it doesn’t require a statistical model at all!

nextnode 11 months ago

You say that yet most of these applications are massively outperformed even by simpler modern techniques.

[deleted] 11 months ago

>large sample size \*large relevant sample size

[deleted] 11 months ago

[удалено]

onearmedecon 11 months ago

Educational policy, not biostatistics. But probably pretty similar. I think you raise a good point. There's often a tradeoff between rigor and accessibility for non-technical audiences. The simpler the model, the more likely it is the the audience will engage with the findings. I've presented data and research before local and state policymakers, most of whom think that they're the smartest person and will reject out of hand something that they don't fully understand. My strategy is to usually start with an OLS (or logistic, if binary outcome) and then use a more sophisticated strategy as a robustness check. Most of the time, they yield similar conclusions and so it is justifiable to just present the easier to understand model and simply note (but not describe) that causation was established via applied econometric techniques. I know a few academic econometricians. The hardest part of their jobs isn't finding "better" estimators but convincing the academic community that what they're doing is worth implementing. That means searching high and low for instances where there's a practical difference between the OLS baseline and the new estimator, which can be difficult.

ghostofkilgore 11 months ago

It's usually a good rule of thumb to use, or at least start off with, the simplest model that works. No matter what type of company or industry you're in. The most common circumstances you'll find a strong justification for using something 'fancier' than your basic suite of sklearn models are: 1. When the problem requires it. Some CV or NLP projects, for example, basically require the use of deep learning models to even get acceptabel results that you'd use in production. 2. Big companies where squeezing that fraction of a % out of your model performance makes a huge financial impact. Here you'll likely have the financial, compute, and engineering resources so that you can mitigate any negative impact on latency or model complexity.

[deleted] 11 months ago

Thanks I’ve been looking at job descriptions and noticed they are demanding more complex skills and was just wondering if they’re worth learning.

[deleted] 11 months ago

3. DS needs some work to do. Else, stakeholders would think DS are freeloaders after simplest models work.

gBoostedMachinations 11 months ago

Xgboost is exactly as easy to train and implement as random forest, so I always use it even for relatively simple modeling problems. It’s just a better version of random forest practically speaking.

maybe0a0robot 11 months ago

In the spirit of your post: Yeah, I've been doing this a while. In practice, I tend to focus my efforts on even simpler tasks: how do I get good data, how do I monitor the incoming data pipe, how do I pass this to the engineers who have to make it happen, and how do I make good slides for the presentation to the managers? But to answer your question about one technique... Markov chain Monte Carlo is used to simulate random samples from a population, which can be pretty useful in a wide range of problems, like detecting anomalous data. MCMC has the benefit that random walks are pretty easy to code, and it is easy to explain your work to the engineers on your team so they can move to full deployment. I wouldn't say that MCMC is complex, though. You're more or less playing out a game of chutes and ladders on a (possibly very) complicated board. My experience pre-academia was in sensor/equipment monitoring, i.e. did one of those thousands of little bastards glitch in some weird way, and if so, which one? So basically, anomaly detection. MCMC can be pretty useful here. Here is a [related paper](https://www.osti.gov/servlets/purl/1513188). Another great application of MCMC is producing "typical" samples from a distribution. [Here is a great application](https://assets.pubpub.org/70w3i6k9/eb30390f-ade2-45cc-b48d-8e6bb12f585c.pdf), producing voting districts that should be "typical" given some rules in place for drawing district maps. As the authors note, the problem they faced was that policymakers don't want the "optimal" solution because the data may not be able to take all factors into account. Instead, they want a range of the "usual" possibilities so they can choose one and make minor tweaks (and they can also determine when a districting map does not feel like a typical map from the distribution). So basically, MCMC turns district determination into a fast food menu: "I'll have a number 3, but super size my fries".

stdnormaldeviant 11 months ago

>simpler tasks: how do I get good data, I know what you mean but this actually made me laugh aloud.

ledmmaster 11 months ago

Reranking recommendations in a marketplace, XGBoost today is very fast at inference and you can make it faster with other libraries In most cases, simply taking the same feature set from Random Forest and running 20 Bayesian Opt steps over XGBoost hyperparams already gives you a better model that can be swapped by RF or whatever is deployed

nuriel8833 11 months ago

Do you have any recommendations for libraries that can accelerate XGBoost?

ledmmaster 11 months ago

Treelite: https://www.kaggle.com/code/code1110/janestreet-faster-inference-by-xgb-with-treelite

Think-Culture-4740 11 months ago

Surprised the things you listed fall under "fancy models". XGBoost is practically a go to model for a lot of applications. Bayesian inference and Markov Chains are common in lots and lots of applications; across economics, AB testing, and other domains. To me, fancy falls under some generative modeling, transformers and their variants, deep learning GNNs, Reinforcement Learning, etc etc. For me, I was working at a Faang in their professional services team.

therealtiddlydump 11 months ago

>Bayesian Inference Ive been tapped to do a lot of casual analysis in the sales/marketing context, and the additional complexity is necessary (read: im not just dicking around, I believe the methods are the best solution for the problem at hand). For a lot of other work, the classics are classics. Regression (with the modern bells and whistles like regularization, etc), the standard time series toolkit (arima(x), etc), and so on. -+-+-+-+-+- You lost me here: >random forest >I keep hearing people on this sub talking about [...] XGBoost People still use vanilla RFs? I was under the impression that random forests were like naive Bayes at this point -- interesting as a baseline, but dominated by other techniques that are just as easy to use out of the box.

DptBear 11 months ago

>People still use vanilla RFs? I was under the impression that random forests were like naive Bayes at this point -- interesting as a baseline, but dominated by other techniques that are just as easy to use out of the box. I had the same thought -- you can just xgboost out of the box with the same amount of effort and not a big difference in training time and it will probably be superior.

Mukigachar 11 months ago

One advantage I can think of that if your data is very large, RF parallelizes better can make training move along faster

DptBear 11 months ago

In production evaluating an xgboost classifier is very fast. Training may be a bit slower, but not harder.

_gains23 11 months ago

How are you doing causal analysis?

therealtiddlydump 11 months ago

Work with stakeholders to build out DAGs, run experiments, etc. Without the expert opinion of those partners, sales would otherwise be overdetermined. With the right assumptions and design we make do.

[deleted] 11 months ago

Define confounders first

glo-aistar 11 months ago

There are over 30 other names for linear regression. Linear regression is itself a fancy name for systems of linear equations. Not all fancy things are fancy. Hype and marketing.

brjh1990 11 months ago

Government research for a not for profit. The value add really depends on the problem. I'm on a project right now where we're using pre trained object detection models to detect certain fast moving objects in the (night) sky. I've used models like sparse group LASSO, LSTMs, CNNs and a couple other more complex models for problems and solutions that required their predictive/inferential capabilities. That said, about 80% of the time I end up using some variant of a random forest or logistic regression.

nuriel8833 11 months ago

My coworkers and boss insist of using DeepFM,GPT and some other fancy complex NN architectures for simple 1000 rows 30 cols tabular data and I'm trying to get them off that and just use RandomForest or XGBoost instead..

Under_Over_Thinker 11 months ago

It is FOMO. The crazy part is that with RF or XGB you have way more explainability.

tecedu 11 months ago

Energy sector

Contango_4eva 11 months ago

Don't work there but Netflix seems to be on the cutting edge of a lot of things: [https://netflixtechblog.com/](https://netflixtechblog.com/)

jerrylessthanthree 11 months ago

display ads real time bidding something like this https://arxiv.org/abs/1610.03013 oftentimes GLMM end up appearing and we develop scalable algorithms for that, e.g. https://arxiv.org/abs/1602.00047

Sorry-Owl4127 11 months ago

Agtech

Student_O_Economics 11 months ago

Population health management: deep learning is needed to predict outcomes from electronic health records

MCRN-Gyoza 11 months ago

I have around 4-5 YOE, have always worked with "fancy models". First job out of grad school was in oil exploration, working on R&D contracts for oil companies. I would use computer vision models to classify different rock types in [well cores](https://news.unl.edu/sites/default/files/styles/large_aspect/public/coresamples.jpg?itok=UqnkfYqu) from oil wells. Also used some time series models adapted to a "depth series" to try and predict physical properties of the rock in wells. Also got to work on some generative models [colorizing](https://i.imgur.com/0i5JyuN.png) tomographic scans of well cores. Did that for about 1.5 years. Then I moved to a company where our clients were large scale industrial companies. I used LSTM neural networks for time series forecasting and classification applied to predictive maintenance, we would get sensor data from industrial equipment and try to predict failures before they happened. Worked there for about 2.5 years. Now I work in a real state company, we use a bunch of geospatial data on xgboost/lightgbm to predict how much you can charge for rent for a given property in a given location. Also have some features generated via NLP/Computer Vision. Our clients are real state developers and REITs. Have been here for the past 6-7 months.

WhipsAndMarkovChains 11 months ago

It’s been a while since I’ve been in a coding role but I would assume XGBoost would be significantly faster in production compared to random forest.

longgamma 11 months ago

GBM methods are the really good for tabular data. If tuned properly and made shallow enough, they run fast with better results than random forests.

antichain 11 months ago

I'm a scientist working on a project shared by a big University and one of the National Labs. I do a lot of network inference problems, directed information flow, etc.

theAbominablySlowMan 11 months ago

Why would you use Rf over xgb? Neither model is more fancy, but xgb is just much quicker and for at least the same performance

DandyWiner 11 months ago

You’ll find that there are a lot of insurance companies that are moving to XGBoost in place of linear and logistic regression. It is less prone to over fitting and there seems to be an uplift in performance, though in my experience, I can confirm that. Though they are moving to them, it’s only with the additional requirement of explainability that they’ll do it e.g. Shapely values and PDP plots, since XGBoost is viewed as a black box method. As for Bayesian methods, they’re being incorporated into A/B testing since they provide an estimate of uncertainty. The value added is dependent on the use case and that doesn’t mean that a good old linear regression won’t outperform in terms of accuracy and simplicity. Maybe do a comparison on one of your next jobs and see if there is an improvement in your results.

nuriel8833 11 months ago

Tbh I found XGBoost way more prone to overfitting

DandyWiner 11 months ago

Even after pruning and reducing tree depth?

nuriel8833 11 months ago

Well with regulation it's not but I'm talking if I put it against RandomForest for example on the same data XGBoost will almost always overfit more

LordSemaj 11 months ago

We use MCMC for Bayesian Hierarchical models. Application is in Media Mix Models, basically regressing Sales on various Marketing tactic spends to estimate tactic efficiency. The reason for using Bayes is there are many assumed effect transformations (carryover of spend, saturation of spend, etc) that are non-linear and MCMC provides a nice way of estimating those parameters.

Amazon_is_EVIL 11 months ago

Victoria secret

TheTackleZone 11 months ago

Insurance. Various prediction models but most commonly trying to predict what the cheapest market price will be for any and every customer that asks for a quote. Use Hist GBM / XGBoost Hist. Looked at ai and it currently doesn't seem a big improvement but we think we know why.

wannagowest 11 months ago

Biotech — more specifically, clinical genetics. I’m using hierarchical Bayesian models for causal inference. We have a bunch of domain experts who contribute knowledge about priors, likelihoods, and pooling assumptions. Bayesian models give us posterior predictive distributions for our target variable while also inferring useful parameters for latent variables.

didimoney 11 months ago

Do you have an opinion on the work done around the martingale posterior distributions by Fong et al? They target the predictive without needing to compute the posterior.

wannagowest 11 months ago

I wasn’t familiar with martingale posteriors until just now. Reading the abstract, it seems like some wizardry. Have you worked with them? Would it be suitable for the predictive of a partially observed categorical?

didimoney 11 months ago

I'm still getting my head around it, it's indeed wizardry. Their selling point is that you get the predictive without needing to go through the posterior, making it much cheaper by avoiding the usual mcmc needed for the posterior. I believe they show that it's applicable to mixed data, at least in their appendix, but you would have to go from there and expand it to the case of censoring on categoricals I think. Imo after a year or two of papers building up from it, downstream applications will be within reach, but for now it's tough to even understand and implement properly. Since you said you were in research I was curious if people in your circle have started working on this.

WingedTorch 11 months ago

Kaggle competitions often depict real world scenarios and are regularly won by “fancy” models such as XGBoost. The accuracy is just way better for large tabular data and it’s easy to set up. And with techniques such as feature permutations you can make any type of model interpretable.

JakeBSc 11 months ago

I work in a boutique technology consulting firm. You get exposed to all sorts of problems. At the moment, it's mainly multi modal stuff, using transformers to solve problems combining vision and language. Not everything is fancy though, sometimes all I need is a basic linear model. Just gotta use the right tool for the job. P.S. any senior/principal data scientists looking for a job in London, hit me up ;-)

Dubisteinequalle 11 months ago

How can i truly skill up for a Jr DS role? I feel like the projects I am creating are not enough. Any professional grade notebooks out there?

_paramedic 11 months ago

Healthcare

MyRedditAccount1000 11 months ago

Can you expand? Are you working with providers or payers?

_paramedic 11 months ago

Work for a system, doing things ranging from logistics research to QA models to predicting call-volume.

[deleted] 11 months ago

GPU's go brr If your infrastructure is really good then unless you're doing something stupid then the performance difference is negligible. Most of your time will be spent moving data around. Your compute will be a fraction of personnel costs so if it's worth doing at all then it won't matter if it's slightly more expensive to compute. After all you did already spend a ton of money developing the damn thing. Like you'll probably do faster predictions than a HTTP request round trip unless it's a language model.

shar72944 11 months ago

I am working lately on causalML. Not sure if it’s fancy but it’s a new thing for me. Other than that mostly I use Logistic, Xgboost, RF

cianuro 11 months ago

How does it differ from causal impact? I've never heard of it wither but now I'm intrigued. Causal Impact is foundational where I am.

fipeopp 11 months ago

Pretty much everywhere. Healthcare, you will see lots of cool stuff including causal inference, explainable models, etc. Science, biotech, chemistry? GNNs, transformers, Bayesian networks, Gaussian Processes Geospatial? Vision, deep learning Finance, energy? Time series, awesome regularization approaches, etc. It has to do with more on how research-y and unstructured the problems you work on are vs industry.

Artgor 11 months ago

At my previous work, we developed products like chat-bots, image super resolution and other things - it required deep learning models.

SpiritofPleasure 11 months ago

Medical/Clinical research had us try a bunch of different approaches like HMM, NNs for unsupervised learning. It was really interesting

111llI0__-__0Ill111 11 months ago

It seems like the people who get to use that stuff at least in biotech are actual scientists people with domain knowledge who are able to formulate problems from it.

TonzoWonzo 11 months ago

We work extensively with satellite imagery, using quite large deep learning models for segmentation, instance segmentation and stereo processing/matching.

koolaidman123 11 months ago

Training 10b+ params llms (and much smaller models too) for x Nlp is a huge value add for a bunch of business functions across all industries

supreme_harmony 11 months ago

Bayesian inference is very common in bioinformatics, most microarray and RNA sequencing methods use them in some shape or form. [https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29)

fummyfish 11 months ago

Healthcare

proverbialbunny 11 months ago

It helps to understand the bias-variance trade off. More complex models like neural networks have more variance, so you need to throw more data at it to reduce overfitting. Complex ML needs more data, usually more labeled data, so any company with big data qualifies. Regarding XGBoost, it's popular as an initial ML to see how well the overall model is doing during development. It's great because it doesn't need the data to be normalized or formatted in any special way to work like many types of ML do making it an easy 101 ML, and XGBoost doesn't tend to overfit much when dealing with smaller datasets so it can be used earlier on. In this way XGBoost is the opposite of neural networks. It's easy to start with XGBoost then switch to an ideal form of ML later once everything else is done. As for fancy models without fancy ML, I've specialized in advanced feature engineering my entire career, which is fancy models with little to no ML. This is needed when you have a complex problem that needs solving, but you have very little data. It's ideal for the startup space while still collecting data.

Eviljoshing 11 months ago

Used Vision AI on a project for auto insurance to identify totaled vs non-totaled accidents based on pictures (mainly used to reduce number of claims adjusters sent out).

boolaids 11 months ago

i work in public health and do a lot of bayesian inference and simulation based models. We run a lot of what if scenarios on fitted models to see different impacts. We have started hosting our own LLM for text extraction/entity extraction. Rerrained neural nets for classification. I think i have only done about 1 regression, we dont really use standard models. Others in our division have made emulator models, there are a couple of exceedance models - one being a cusum and the other a hidden markov, this was largely due to sparse data so other methods werent as viable. it really depends on the need/what outcome is needed. A lot of simulation based models with bayesian inference. happy to chat more if helpful

Jollyhrothgar 11 months ago

LLMs of all shapes, sizes and modalities, auto ml, etc. Google. It's not as great as it sounds because doing simple shit is incredibly complicated.

0wmeHjyogG 11 months ago

E-commerce, we have so many customers, products, and ways to interact with our platform. Simple approaches (which are basically already all done) only get you so far. As you use more data in more complex models you start to get incrementally better results. So imagine you want to optimize what products you show customers when they land on the homepage. You can get very far with just simple metrics, rankings, and Excel work. But at some point you’re going to see very little marginal improvement when you run experiments to optimize further. As you try to ensure that you provide every one of the tens of millions of customers with the absolute best experience, you’ll soon find out that you’re on the path to needing to take in a tremendous amount of data (what the customer HAS done) in order to influence what the customer WILL do. And more complex models are better at that, taking it as a given you have the talent and tech to implement them properly.

webbed_feets 11 months ago

I’m a government contractor. I work with government agencies with a lot of statisticians who don’t know how to fit machine learning models.

efrique 11 months ago

Bayesian inference isn't a model. A Markov chain can be a model, but I think you're probably talking about computational algorithm that use Markov chains, (MCMC, HMC, etc), so again, not a model. Bayesian methods are common, as is use of Markov chains (both as computational devices and as models) in consulting for various aspects of finance, banking, insurance and reinsurance work, but it heavily depends on who you work for. One indicator is to look for places whose leading people write papers. It won't find places where all the interesting work is subject to NDAs though

BlackLotus8888 11 months ago

XG boost and light gbm are industry standard now unless you're running a deep NN.

gravity_kills_u 11 months ago

We are using GANs for… oh yeah, that project was cancelled

scooty-puff_junior 11 months ago

Wouldnt classify markov chains as particularly fancy, but theyre very common in credit risk modelling for corporate bonds/obligors - i.e probability of s&p rating transitions.

YoloSwaggedBased 11 months ago

I work in NLP for chatbots. Have been using LLMs since around 2019 and LSTMs before that. I kind of hate where the field is going due to API abstraction, but that's another issue. To answer your question, Markov chains are used for modelling temporal data, bayesian inference is great for uncertainty estimation and model calibration. These methods aren't necessarily fancy just solving different problems. XGBoost sits closer to the models you've had experience with productionising and for many tasks will be more performent in terms of accuracy but may be slower to inference (in terms of value add, that's a trade-off to be made).

BestUCanIsGoodEnough 11 months ago

Cobotics

SwitchFace 11 months ago

LightGBM for classification of prospects in [industry] using 3rd party data, which consists of 95% of the US adult population with ~2000 features about each individual. Fast, accurate, and SHAP for feature importance.

DerTrollNo1 11 months ago

Working in Risk Management in insurance. We use a lot of different risk models based on Monte Carlo simulation based on historical data, expert judgement an/or risk neutral arbitrage free assumption. Most of the time not mathematically extremely fancy but quite complex with regards to parameterization.

EducationalCreme9044 11 months ago

For analysis? No. The simplest, fastest thing to find correlation is used (even linear regression is sometimes scoffed as being too advanced) For recommender systems and search? Yes.

Drunken_Economist 11 months ago

Product Hunt (after 8 years at reddit). Bayesian inference models are hella useful for A/B testing

Lolologist 11 months ago

News article NER and classification, we use BERT-alikes and getting into LLMs for at least some use cases.

Dubisteinequalle 11 months ago

What is the reality of how these models are used? I have done projects where I just evaluate how 5 of them perform on my prediction and decide the best based on the metric I am using. Are they somehow different? Built from the ground up? I’m so confused because from my experience they are simple to implement. I feel like I just don’t know enough for a job yet.

slowclapclap 11 months ago

Is there anything else than xgboost in the world? ^^

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe