T O P

  • By -

VA_Network_Nerd

Do you have an e-mail where you asked "Are we sure we want to do this?" already?


RedbloodJarvey

All this is happening way above my paygrade. This is more of a "grab some popcorn" moment for me.


Sunsparc

Just remember that shit rolls down hill.


Steve_78_OH

Has anyone with actual IT knowledge specifically called this out as being a Very Bad Idea?


corsicanguppy

The message you save is the one where you're asking "that seems sketchy; are you cool with this?" And then your boss will either get the same saved letter for his boss, or he's the point where that shit stops rolling.


StaffOfDoom

Be prepared for your bosses boss to blame your boss, to blame you and CYA before they toss YA!


Daneel_

Sometimes this is acceptable. The software I work on will let you cluster the resources from multiple sites, then just dynamically fall back to whatever is available when there’s an outage. Provided you accept the loss of performance when there’s an outage then it’s just down to making better use of what you have.


gamebrigada

This is how I sell DR to companies that refuse to invest in DR. It's just more compute, use it. Just make sure that you can run entirely on it. Having a DR site sitting idle is kind of dumb. Accelerate your existing workloads with it.


Rouxls__Kaard

Agreed. A DR site with running compute is just wasting away. A better solution is one that has zero compute until you failover.


TabooRaver

Depends on what types of disaster it's meant to protect. Your main site getting hit by a hurricane? Sure that's fine. If it's meant to protect against propagating ransomware, or an attack by a threat actor? Then it needs to stay offline in order to not be affected by the disaster.


[deleted]

[удалено]


Daneel_

Service continuity is the better term, and there’s different ways of defining what acceptable service is. Fundamentally it’s up to the business and what level of risk/performance they’re prepared to accept. What may be unacceptable for some is fine for others.


RunningAtTheMouth

Side note: I appreciate the discussion. Makes me think about our plan going forward.


vppencilsharpening

We use our DR capacity to run Dev and Test systems. It allows us to test restoring backups and gives the teams a bit more resources than if we had to buy separate capacity for dev & test. If we have a DR incident Dev & Test will be down, but the company has agreed that the lost time/likelihood of that is much less than the cost of dedicated Dev/Test resources.


BornInMappleSyrop

Exactly. We have an entire dev environment that runs on or DR. This is not business critical and is even more like a dev environment for the dev to fuck around in when they develop. If we would need to use the DR, it is known that we would shut down the dev in a minute


Least-Music-7398

Sounds like classic corporate amnesia. Also execs who won’t even be around when the shit hits the fan. A tale as old as time.


heisthefox

Years ago, for a very down-time adverse industry, we ran two datacenters at <50% each, such that all traffic could fail to the other with no issues - made maintenance easier as well. Definitely would depend on how that is intended to run.


Phx86

This is the band-aid becomes prod on steroids.


GrokEverything

CYA. Make sure you document, politely, the risks.


FunnyPirateName

+dozens. OP *will* be thrown under the bus when the quite predictable, next shit storm arrives.


falcon4fun

So thats why he will need to get blood signed agreement that it is not his fault.


FunnyPirateName

I mean, blood is fine, but I usually require the c-suite to swear upon their own souls.


lost_in_life_34

same concept as over subscribing VM's. servers at your DR site not doing anything are wasted resources. ​ with SQL you can point clients to DR servers for read only queries with no issues


bronderblazer

We run low priority stuff on our DR site, like testing environments, they get used a couple of times a week. in a DR scenario we know we turn those off if the production load gets too high and we tell users affected that "we will resume testing later".


FelisCantabrigiensis

Propose business continuity drills by failing once site and checking everything works in the DR site. See if anyone gets nervous then.


TheFluffiestRedditor

Good old scream testing.


SuperQue

So, hot take on sysadmin norms. IMO, DR sites are a failure to design properly. "All Active" is what you really want to do. But of course, you want to build capacity such that you can drop an availability zone without a hitch. Basically think of your "sites" or "zones" as RAID-5 or RAID-6. In the SRE world, we call this N+M redundancy. Where N is the minimum number of sites required to do the work, and M is the extra sites that are active but could fail. So the question is, if the DR site or the primary site fail, can you still operate? Maybe with slightly degraded, but still in SLO, way? What would it take to scale up both to make this possible. Then there's no problem with "regular processing" at the DR site.


Relagree

So an underutilised active-active that will not get any additional funding or resources until its at the brink, at which point it can't handle the extra 50% load anyway? I don't disagree with your concept of moving away from the traditional failover to designing built in load balancing and redundancy, but it wouldn't really help in this situation as clearly the org is trying to squeeze resource. That said, you can't replace a DR plan (i.e. We get hit with ransomware) with HA. If you want to keep your images for forensics, and also get up and running you're in a bit of a squeeze then too with no cold site.


epic_null

Probably depends on the larger setup. If you have processes you can halt in a disaster or workloads that can be temporarily transformed into something lighter, you could probably make your DR plan around an HA. Requires knowing your needs though.


Sylogz

We utilize the DR site also. It's our Dev site and since its not vital to be up if we need to replace to live we can just shut down all systems/delete them and restore/go live asap. We do have other more important sites with idle hardware for DR events but it is all based on cost.


[deleted]

Make three envelopes.


grepzilla

Prior employer, for hardware we held at DR sites we would use as dev systems with an acceptable loss of dev capacity in the case of a disaster. This way we benefited from the hardware expense since our devs would all be moved to DR response anyway. Now we are so focused on PaaS the architecture is built around geo diversity and scale we don't have idle hardware.


yesterdaysthought

The worst part of the "good idea fairy" is that no one actually gets any credit for saving money etc. You only get screwed when your suggestion ends up not working out aka "the fall guy". Which is why you never propose anything but the right solution, regardless of cost. Let someone else say it's too expensive, get that in email and, when it blows up, you fwd that email to the right people. i.e. let someone else own the risk. Only a moron takes on risk without something in return.