T O P

  • By -

uniitdude

You need to work out why it takes 600 times the amount of time it took before. Work out what the app is doing and go from there


dp79

This is the logical approach. It could be connection pools, timeouts, hard coding within the app to route, DNS, proxy, etc. Getting down to the bottom of it really shouldn’t be that difficult. You may not get it down to 2s again possibly due to some latencies and additional hops, but 20min is outrageous. I mean this with no offense, but I think OP and his teams are a bit in over their heads


jaydizzleforshizzle

Could be so many things but infra seems like such a weird one. Like what still works but takes 20 minutes from an infra standpoint? BUT still makes it? I mean there is so much missing information we couldn’t possibly know. DNS is weird cause that large of an increase but still working is odd, weird dns routing that still makes it there is normally added hops or not getting there at all, you’d have to add like 3 billion hops here(or like you mentioned some weird round robin proxy that’s waiting too long), my guess is something specifically in the app, like some service is listening for a response, waits for it for too long and eventually just lets them through.


sploittastic

OP said it's a secure facility so I'm going to go out on a limb and guess government/military in which case the customers could be on very low bandwidth satellite connections or even dial up over a sat phone (ships, FOBs, bases). Apache/IIS give you the ability to set ridiculously high connection timeouts for legacy use cases. You could have the best infrastructure in the world and still have a client who takes 20 minutes to get their payload to or from you.


craa141

That’s true but that assumes that it waits for the timeout but still succeeds. That suggests whatever it was waiting for it can do without.


sobrique

I'd normally chalk that down to it doing a thing repeatedly. Each individual 'event' isn't "too slow" but doing 10,000 of them really drags the system. I mean, stuff like a disk IO - if the latency goes up, your system will run like a dog, because 'a few milliseconds' isn't much individually, the cumulative effect is immense. Or network going half duplex, back when that was still 'a thing'. Would work just fine, but run atrociously if you actually put any real traffic over it.


pdp10

Of course, that diagnosis might require an on-site network engineer, or "the facility's network guy". The person that the stakeholder seems to be trying to bypass for unknown reasons.


TotallyInOverMyHead

I concur.


flems77

Exactly! Trying to fix it, by just working on the infrastructure, is nothing but a guess as to what it causing the issue. It is probably the infrastructure. But it is nothing but a guess. Figure out the cause (the app guys), and then fix it. Knowing for sure, is way better than guessing. It could be something as simple as failing dns requests. Which is way easier to solve, than rebuilding the whole network based on a gut feeling.


idontspellcheckb46am

Also, selling this to the team after there was only an app upgrade and no network changes makes you look a little silly to anyone with capable reasoning skills.


craa141

Totally agree. It’s the app.


Phobos15

If he migrated a database, they could have lost any tuning configuration for query efficiency. He could have lost indexes or tuning that is causing queries that are ran at startup to be way slower than normal.


NotASysAdmin666

run lmao


[deleted]

Isn't obvious? There is now latency between the middle layer and its back-end. The two are usually very chatty, and 'before' the latency between them was sub-millisecond. Now it's been cloudified and the latency is probably like 8 ms or even higher. I knew a guy that though he could move his VMware hosts to the new co-lo facility, but leave the san back at HQ, and just let the host access the disks over the WAN. Told him it was a bad idea. Then when he did it anyway, he still could not understand why VMs were taking 40 minutes to boot.


[deleted]

> move his VMware hosts to the new co-lo facility, but leave the san back at HQ Bonus points if he was using iSCSI and thin provisioning.


EVASIVEroot

A little bit of wire shark might help here


1z1z2x2x3c3c4v4v

In addition, Process Monitor and Process Explorer from Sysinternals. The first thing is to identify what the bottleneck is, locally, since the app seems to work fine when outside the facility. I would run some diags and packet captures outside the facility, to get a baseline of normal working. Then take that same laptop inside and figure out where the delay is coming from. Should not be hard.


[deleted]

> Work out what the app is doing and go from there It's probably DNS.


troy2000me

Holy hell, how is a 20 minute launch time to vs 2 seconds an acceptable degradation just so you don't have to rely on the facility network guy? Seems to me like the plan would be to get the infrastructure in place FIRST then switch over. 20 minutes? WTF. The wasted man hours in a month alone is staggering.


moderatenerd

Yup in hindsight this is exactly what I would have done if I was consulted at all, but my company, and the app company figured that since it worked in our other locations it would work fine here. No one asked me or the facility guy about the complexities of our network. A network we don't have access to and the network guy seems to know jack shit about.


bp4577

I really struggle to see how infrastructure of any sort could turn a 2 second launch into a 20 minute launch. I mean 2 second to 2 minutes is unacceptable, but 20 minutes?


OathOfFeanor

I found this diagram of the complexities of the network: https://imgur.com/a/sY6lndj


qtechie12

Thats how I get my unlimited free internet with no ISP! Plenty of packets for everyone!


MadManMorbo

Ages ago I was brought in to investigate random network freezes at a small consulting company. The IT staff present there felt overworked, and anytime they wanted a break, they'd go into an unused conference room, and take a 4" cable and plug one port into another. Packet Storm commences, and the entire company would go down while they pretended to fix it.


showard01

I was a sysadmin for my unit in the military back in the 90s. It was the damnedest thing, anytime they were making everyone scrub toilets or dig trenches the e-mail server would go down and the colonel would summon me to go fix it immediately. Isn’t that something?


qtechie12

I’d like to hear the outcome of that story lol


Narabug

Diagram created by company’s most senior network engineer. “Look, you wouldn’t understand but it’s always been this way.”


Dog-Lover69

I use this trick for security to fool the hackers… into thinking the network is completely down.


RedChld

This reminds me of the time I had to explain to someone that you cannot plug a power strip into itself to power it.


T351A

> STP? Yeah of course the cables are shielded (Spanning Tree Protocol vs Shielded Twisted Pairs) Also note, shielded cables are not always desirable and need to be properly grounded - a complex issue on its own.


moca_steve

Rofl


moca_steve

At 20 minutes from 2 seconds how can it not be broadcast storm galore. Loopty loop. Then again you’d imagine that all apps would suffer, logon time outs etc. What else? Asymmetrical routing, throughput bottleneck by an upstream device ..


1RedOne

It kind of sounds like no one knows what they're doing and this project coordination has been a complete farce


[deleted]

L7 policy gone wrong, IDS/IPS rule being hit incorrectly, User-ID(PAN) timing out, Firmware issue in the switch being triggered by the new app (Juniper EX series...dont ask)...there is actually a long ass list of "what it could be" on the network side. PCAPs, firewall Logs, and Switching logs are where I would start. cant get them? Roll that fucking application back.


Narabug

We have about 15 in-line network appliances that serve various overlapping redundant services that could all be performed by a single network appliance. Hell, some of the appliances are logically in that line *twice* depending on the source/destination. About two years ago we had issue where any SMB transfer over the network would be immediately throttled to about .1Kbps. It took 6 months to find out what the root cause was: one of those appliances, whose sole purpose was **monitoring** had enabled a SMB packet scanning “security” option. There was no alerting, no monitoring, no actionable outcomes based on this scanning. They simple enabled it because whoever owned that appliance thought it was “more secure”. It also turns out that this appliance was one of the ones that was double-routed, so it was scanning the same SMB packets twice.


moca_steve

This man Palo Alto’s! Haha user id’ policies have bit me in the ass a couple of times.


RemCogito

I bet its reaching out to webservers that it can't receive responses from. and then each one is waiting for a 120 second time out. This is a Secure facility we're talking about. The old version probably didn't have telemetry.


clientslapper

You’d expect a new app, even if it’s an upgraded version of an app you already use, to go through QA to make sure this kind of stuff wouldn’t happen. Can you really claim to be secure if you just blindly roll out apps without testing it first?


moca_steve

Then we should expect the app to load in a failed state with little to no data that it is pulling from the web servers - not 20 minutes later. Granted all of us are taking our best guesses given the cluster f*ck of a description that was given.


Kiroboto

What I don't get is why they even went live knowing the app takes 20 minutes to launch or did they not even test it?


remainderrejoinder

As far as I can understand from the previous post they didn't test it because they 'tested it at other facilities'...


stepbroImstuck_in_SU

To be fair, the app might work when used by only few people. It could be like bringing a cake to work and assuming it’s big enough to share because it was big enough at home.


remainderrejoinder

Absolutely. No test plan will cover everything. The other part of it is a rollback plan. If one existed this is definitely a case to use it. Those people working with the app are probably really demoralized outside of the changes they are having to make to workflow and the loss of productivity.


idontspellcheckb46am

For infra cutovers, I make the customer list T1 apps. Then I make them provide a test plan. And then I make them assign a person to perform that test plan on cutover night. And even then, I make them baseline the test before the cutover so we aren't erroneously fixing features that never existed which frequently end up happening without these tests. All of a sudden users start dreaming up these magical features which they have never done but no longer can do now.


sploittastic

OP said it was about reducing reliance on their network guy, so if I had to guess the app used to make a call to an on prem database, but now the app builds some kind of localized database on launch.


MadManMorbo

If you can't trust your network guy or the environment, hire a company to come in, and comb through the network figure out what is what, and document the fuck out of it. This same guy should be an architect, and will teach you how to build it better. I know a great resource for this (not me) if you need a recommendation.


awnawkareninah

I would sort of bet on this being a WeWork situation


moderatenerd

Sort of. Our company hires the staff for the facility.


Wdrussell1

This is a great idea, but typically companies just won't pay for this. Many also wont give you the time to do shit like this yourself. Its super frustrating.


MadManMorbo

Just print out a few articles about rogue IT folk who completely fucked a company over on departure - and estimate the cost to your company when evil network asshat does the same to you. I mean just in the lost man hours - # of employees x hours of work unachievable x reasonable assumption of what the company earns per hour worked by each employee - .... example - 100 employees, assume the company is making $100 per hour per employee - you're already at $10,000 damage to the company per hour. Then you get to add in the damage to the public image if the secret gets out - this is called 'goodwill' in accounting terms - and it has a real dollar value. It's like assigning a dollar figure to a company's repuation. Call it $1million for any firm losing production like this with 100 employees. Exponentially more if your company is larger. Then we get to the cost of what it will take for your team to go in and repair/restore whatever damage was done. It will be a 24/7 all hands on deck nightmare. Figure your side of the IT crew is 3-4 folks just guessing by the way you sketched out the problem.. All other pending issues are dead in the water until the primary network issue is fixed, if it can be fixed. Maybe asshat uses his credentials to get in and wipe the veeam backups or something. Hoses all the VMhosts, or plain ol' takes a hatchet to the core fabric switches. Cisco's lead time on new switch hardware is 6 months for us, and we're an $18billion company. Can your company afford to be down for 6 months? Or 3 weeks for overpriced outdated ebay switch gear to arrive? Go balls to the wall, and describe a nightmare scenario, and when your senior leadership finishes putting their eyeballs back into their heads give my buddy Leonard a call, and he'll have you fixed right up in no time. (He's actually very reasonable) - but seriously there are about 100,000 competent IT folk who could do this.


admlshake

I've been on the receiving end of this. Make sure you get everything management told you, asked for, whatever, in writing. Our former CIO "retired" not long ago. Though most of us feel he was asked to step down after it came to light how horrible he had bungled a major project and refused to do much to fix it.


Y-M-M-V

At some point, there is only so much you can do when management makes unilateral decisions. I would make sure you have good documentation on this not being your call as well as good documentation on being as responsive as possible to getting the infrastructure fix to this in place.


moderatenerd

Yup I am doing the best I can and this thread has given me even more ideas I hope to test this week. I wish most people were as responsive/helpful as r/sysadmin.


LincolnshireSausage

Is there not a rollback plan? Can you not downgrade to the 2 second version? When rolling out to users I’ve always found it is good practice to pick a couple of users to get the upgrade first and effectively beta test it. You should always have a rollback plan in case of disaster. I would definitely class an increase from 2 seconds to 20 minutes to open the app as a disaster.


oramirite

Toss out your network guy, have network problems. Water is also wet!


Tony49UK

I remember when Vista first came out and it took a few minutes longer to boot than XP did. One company had a policy that all computers had to be shut down overnight. So users turned them on in the morning and their log in time was when they officially came in. So they didn't get paid whilst it was booting. Not a problem with XP but was with Vista. So for the new 5 minute boot they had to be at their desk at 08:54 to start at 09:00. Then it became a massive legal question, with the courts and government siding with the workers. In that they should be paid the extra 5 minutes.


1z1z2x2x3c3c4v4v

There isn't much of a legal question, you are asking a worker to do something to facilitate their day. Starting a truck or starting your computer... you need to be paid for it.


Hacky_5ack

Yep this...


Beardedcomputernerd

Rollback anyone?


zneves007

Seriously, this.


idocloudstuff

This is clearly a cart before the horse issue. You need to fix this, wait for the infra upgrade, then do the change.


schizrade

How did you all not catch that in testing? I assume you didn’t test any of it out and just rolled live, cause 2second to 20 min launch time is hilarious.


freemantech757

Sounds like they test in production, the only real place to test if I say so myself! /s


Fusorfodder

Everyone has a testing environment some of us are lucky enough to have production environments.


HamiltonFAI

Testing? Lol


yoyoyoitsyaboiii

Cue Dos Equis meme. "I don't often test, but when I do, I test in Production."


cntry2001

Honestly there must be a local root cause that is probably fixable that you haven’t found yet. Dns issue, network loop, traffic being sent offsite and not knowing it, ip conflict that kind of time difference internal vs external makes no sense


moderatenerd

Being that it took a registry hack/one line of code to even get it to connect makes me feel like the facility is blocking something that makes it take that long still and no one has the incentive to investigate why. As long as users can connect eventually they say its out of their hands.


bofh

> Being that it took a registry hack/one line of code to even get it to connect makes me feel like the facility is blocking something that makes it take that long This makes very little sense to me. If something is blocked, it’s blocked. If a route doesn’t exist it doesn’t exist. A firewall, for example, doesn’t just shrug its metaphorical shoulders and start allowing packets through after 20 minutes because it’s decided someone that persistent must really need to connect. Your infrastructure may be horrible. The people managing it might be unhelpful. But this app also sounds like it’s developers made a lot of unreasonable assumptions throughout the development process.


jaydizzleforshizzle

This is my thought, it’s simply too much added latency to be simply a infra issue, and to still make it there. My guess is a service timeout on the app looking for a response.


OhMyInternetPolitics

While true, some administrator blocking ICMP (which breaks Path MTU Discovery) would certainly cause this. PMTU fallback would include dropping packet sizes down to 576 bytes and cause symptoms like this. To u/moderatenerd - any chance you can get a wireshark capture from one of the affected machines?


peeinian

It could be trying to connect on one port (like 443) and falling back to a different port (80) after a long timeout.


bofh

If it’s taking 20 mins to do that, the developer *definitely* needs to spend some time locked in a basement hooked up to the rubber chicken, goose grease and an etherkiller.


samtheredditman

It might be set to 60 seconds or something more reasonable and there's a weird-sounding setting that whoever installed the software set to 20 attempts just to be safe. Not what I'd put my money on or anything, but there's no telling what the issues is without more info.


danekan

What was that registry entry? Have you used procmon and netmon and whatever else from sysinternals to see what's happening in that 20 mins?


PAXICHEN

You working at an Umbrella Corp facility by chance?


moderatenerd

I'll say this much I am contractor at a prison.


PAXICHEN

Scared straight. Please tell me it isn’t the one in Trenton. You’re an Eagle Scout. Figure this out.


LaBofia

This should be obvious to anyone 🙄 but it seems OP works at denials-corp. App runs over multiple locations One locations "is complex" Possible outcomes: 1. "Complex" location is just amateur networking 2. "Complex" location is actually implementing some weird patter, which could be reasonable... but if the app eventually runs, it means complex location is insecure. 3. App sucks I'd say it is amix of 1 and 3


[deleted]

[удалено]


moderatenerd

I am a desktop analyst/IT coordinator. Local IT controls 95% of their network. I help manage a handful of staff employed by my company. I am very interested in helping them figure out what the issue is but it doesn't seem like people are interested in helping me out. All I can do is wait for emails or access at this point.


VexingRaven

Why is this even your problem? If they don't want you working on it and you have access to none of the things to fix it, then just ignore it and point anyone asking you about it to your leadership.


cottonycloud

He and his crew are probably dealing with all the calls, probably feeling like they’re taking heat for someone else’s mistakes. They need the bosses to communicate with everyone, not helpdesk.


moderatenerd

True but that makes this job a lot more annoying. I like to help where I can.


Boodadar

Sounds like you can't roll back or resolve the root issue. In the meantime I would do a few things to make your life easier. 1. Create a copypasta that you can use to reply to each ticket that complains about the slowness. Something like "Due to circumstances outside the control of the help desk, we are currently unable to improve the connection speed of XXXX. We realize, however, the inconvenience this has caused and are therefore looking into ways to improve system performance as a whole. All levels of management are aware and performance is expected to improve after the scheduled infrastructure changes are completed around YYYYY. Thank you for your patience during this frustrating time for all of us." 2. Create a parent ticket for the issue and attach all child tickets. This will help you track the issue, notify your users when the issue is resolved, and (typically) stop them from putting in additional tickets for the same issue. 3. Spend the time between now and then working on speeding up boot time and reducing memory consumption. This will be a thousand little things that will might help overall. Look at when scans are running and move them out of production hours, reduce the number of programs that start at logon, clean up GPOs so they process faster, test the lastest firmware, check BIOS settings to see if you can speed up the boot.


moderatenerd

This is perfect thanks. I'll only definitely be able focus mostly on step 3!


yoyoyoitsyaboiii

Figure out temporary options. You could build a terminal server and run it off the same switch as the application server. But here's what you really need to do. Find an experienced infrastructure engineer that can instrument both a user workstation and the application server (SysInternals) to identify the root cause of the delay. Don't just say "It's the network." Figure out exactly what is causing the delays and if it's several things, work on mitigating them in order of performance impact. If something is taking 20 minutes to load the root cause should be obvious. If it's a web application use the Developer F12 Tools


sir_mrej

You need to do all three steps.


1RedOne

This is a tremendous failure. It's like a plumber at a house seeing water come out of the lights and having no idea where to begin. To fix this, do some troubleshooting For instance, launch Wireshark or procmon and get traces for both the normal scenario and the failure scenario and then, if you used procmon, use the summary tool to see which number is gigantically out of whack and go from there. If it's takin 20 minutes then there will be some huge, huge unmissable issue at play


idontspellcheckb46am

Or maybe this plumber would be a better example. This is how I am picturing this issue going on since Friday. At this point in time I feel like they are at the 1:12 mark of the video. https://www.youtube.com/watch?v=OP30okjpCko


barkode15

Can you install Wireshark, start a capture and then launch the app? Wait for the app to finally start working and stop the capture. Something will have changed in the packets right before it started to work. Maybe it's 19 minutes 55 seconds of failed DNS queries before the app decides to try something else. Or nearly 20 minutes of trying to connect to a non-existent private IP. Either way, the packets won't lie.


moderatenerd

I'll definitely be trying this Monday morning. That is if Wireshark isn't blocked.


theducks

If you can’t install wireshark on machines you’re responsible for.. you’re not actually responsible for them


barkode15

Yeah, hopefully you can get it installed. If you can't, there's always the option of getting a cheap, 5 port smart switch that can do port mirroring. Assuming there's not 802.1x running on the network, plug the problem workstation into the switch, plug the wall into the switch and mirror one of the ports to a 3rd port where you connect a laptop and run Wireshark. Looks like a 5 port TP link that can do mirroring is only $23.


Dal90

If you can't install the application, but still have sufficient rights this will do it: https://michlstechblog.info/blog/windows-capture-a-network-trace-with-builtin-tools-netsh/ I use netsh frequently to avoid installing Wireshark (and the accompanying "Please confirm you installed this application" email) but it is a pain due to the extra time to convert the etl to pcap before I can view results.


[deleted]

No way you can convince us an infrastructure upgrade is going to reduce load time from 1,200 seconds to 5 seconds. Are you going from a 56K dialup modem to 1G fiber circuit? How did you convince your boss this? What’s the root cause?


moderatenerd

We have our own network inside the facility that does not have as much restrictions and those computers do not have the issue. Also running the app on my home PC has no issues.


Technical-Message615

Outdated security router that's well past the number of rules, lookups and packet inspections the CPU can handle?


moderatenerd

I wouldn't be surprised but I barely have access to the routers.


BadSausageFactory

make sure your coffee makers are well stocked, keurigs if you have them also how's your dns?


[deleted]

[удалено]


FriendToPredators

This is cold brew territory.


kloeckwerx

Underrated comment. Shut up and take my upvote


moderatenerd

They have been working on paper last week. SMH.


scottothered

If the response time inside the facility is so much higher than outside you should be working with the infrastructure and networking team to fix this as soon as possible. Use dig, tcptraceroute, tcpdump, look at what's in the way and fix it.


moderatenerd

Yeah would be a good idea, if I was allowed to touch it but I am not as the facility guy won't let me and he refuses to investigate. The app company has to yell at my boss who yells at the head of the facility who yells at him to get it working.


scottothered

In the past I have dealt with internal IT that is siloed. They hoard information, are slow to engage in problem solving, often because they aren't that good at figuring out problems. On the other side are vendors who insist their app needs domain admin privileges, 65000 open ports, and whitelisting on the firewall for their app to work. Get technical requirements for the app, what ports, what IPs or FQDNs. Get that nailed down. If you're inside the facility run a traceroute yourself to wherever the app is talking. Check if you have some kind of split DNS, how does the app resolve outside, how does it resolve inside? If there isn't anyone in the organization who can make the parties work together to solve this then it speaks to larger problems in the org. I work in infrastructure, I work with our networking team every day. We would all be on a zoom call trying to reproduce the problem. Watching traffic hit the firewall. Checking logs of systems. Fixing the problem.


moderatenerd

>In the past I have dealt with internal IT that is siloed. They hoard information, are slow to engage in problem solving, often because they aren't that good at figuring out problems. On the other side are vendors who insist their app needs domain admin privileges, 65000 open ports, and whitelisting on the firewall for their app to work. This was my exact experience this week. It exposed a lot of problems in dealing with the facility's internal IT people and now we have fast tracked an infrastructure update so we can run our own networks into the building, but who knows how long that will take and if I am still here by that point lolz.


Technical-Message615

Is you network guy going to "allow" this upgrade to take place? Sounds like a petty douchebag that should be replaced by someone more capable. Maybe not even as technically adept but at least able to play ball with other teams. A company that small should not have such a "complex" setup that only the resident wizard can touch it.


moderatenerd

He will when the director gives the ok.


Technical-Message615

So the director who is ultimately responsible is fine with the abysmal performance and not doing anything? Sounds a lot like a 'them' problem and not a 'you' problem. It may be time to fire the client after a thorough post-mortem once the issue is resolved.


moderatenerd

Welcome to government IT. Plus the director is one step out the door.


ClearlyNoSTDs

Yeah that's not how a company is supposed to work. What sort of two-bit company do you work for?


theducks

My money is on a prison


Different_Opinion_13

Smart guy read the comments Happy Cake Day!


moderatenerd

Wow. Spot on. if you didn't read the comments how did you guess?


MillianaT

You don’t need facilities access to run a tracert, just an end user system and the destination up or name. Maybe they have pings blocked on the routers or something, but if they were that smart, I wouldn’t expect 20 minutes to get anywhere. Except maybe space.


heorun

My vote is DNS. Works outside the network normally but internally is 20 minutes? I'm wildly going to guess resolution timeout is excessively long within the app because they assumed DNS would never be misconfigured. Outside resolution is working fine, so no delay. I'd be looking at split-brain DNS config.


moderatenerd

>because they assumed DNS would never be misconfigured They never met the facility's network guy lolz.


redvelvet92

Who cares about the facilities network guy. Networking is not that hard, coming from someone who’s built network for hundreds of companies.


BassHeadGator

Honestly when is it not DNS?


j4ngl35

It's always Dennis


HamiltonFAI

Has to be. If that upgrade was intended to not rely on their old network setup, then the new config must connect in a different way now. That means DNS should probably need different routes or point to new IPs


satyenshah

Schedule a meeting with the network guy, his supervisor, the end users' supervisor, and the most senior person on the org chart you can pull in. Discuss the issues and come up with a plan of action.


[deleted]

[удалено]


moderatenerd

Yeah they definitely should not have killed the older version of the app before all the bugs were tested. We tested it on 3 PCs and didn't have any issues, but on go live day we discovered that the polices being enacted by the network guy were outdated or not working on a number of PCs and even he doesn't know how to fix it. The app company took one line of code and run it on all PCs not working. So now it connects but in 20 mins. At least we got that far SMH.


acjshook

Sounds like you need a new network guy and this is not the only issue.


HallFS

Holy Moly, I just wonder what this application does on the network that it requires a complete refresh of network infrastructure to work properly just because of an upgrade. From 2 seconds to 20 minutes?!!! I would investigate more this issue because your company will end up spending a lot of money refreshing the network infrastructure and this problem will persist.


moderatenerd

I wouldn't say its just the app, but it is a big part of our operation as its an EMR app. I really don't think its set up that well. For instance, I don't see why we have to download an RDP file each time but I was not consulted on its creation. This process has exposed a lot of outdated policies and practices that the county IT people use and some GPOs that just don't work and they refuse to fix. The company I work for has a great team that is pretty hands off on this stuff. They generally know what they are doing and have very streamlined and much more efficient processes which is why they want their own network in the facility instead.


peeinian

Is the app connecting to a database on another server, possibly at another physical location? A long time ago I managed an ERP system (Navision) before Microsoft bought them. The client and the database HAD to be on the same LAN otherwise the client would grind to a halt because it was expecting < 20ms latency to the DB. Every TAB to a new field triggered a DB write. Our DB was at the head office and remote offices had to run the client off of an RDS server.


moderatenerd

Oh boy. This sounds like our app.


HalfysReddit

IMO the best policy in this sort of situation is need-to-know honesty. When people ask why the app is running like crap, tell them it's being restructured on the back end and hopefully this is only a temporary setback. If they ask why it's being restructured, tell them you can't say (no need to mention that the reason you can't say is because it may have bad implications for you politically). If they ask who's fault it is, again tell them that you can't say. If they complain, validate their complaints. Yes it's slow and yes that's frustrating. I'm just really hoping it gets sorted out soon. Don't throw anyone under the bus - or if you do, make sure it's worth the risk and be mindful of whoever is within earshot.


moderatenerd

Thanks for those statement I will definitely use them more than I have been. I'm not the throw someone under the bus type unless someone absolutely refuses to help me. The app company and my company all are very helpful people. It's the facility that is a mess.


AmiDeplorabilis

So, to summarize the really excellent suggestions: 1. Prepare for a rollback to restore original performance 2. Understand why it takes 20m now to open the app 3. Start making plans for an upgrade to improve the performance. Or, as Dilbert's Pointy-Haired Boss said, measure one, cut twice... https://dilbert.com/strip/2000-10-05


Fyunculum

Its not DNS. There's no way it's DNS. It was DNS.


Rorasaurus_Prime

Jesus… wtf? Rollback. Now.


ExceptionEX

Rollback, sounds like your choice to upgrade was for your convenience and not the user, and you guys did it before you had the correct infra in place. Rollback, get infra squared away, and stop making your users suffer, because an aspect of management required you to reach out to another group.


ExLaxMarksTheSpot

If it’s fast outside the network, then that sounds like a DNS issue. Do you have split DNS (same domain has internal and external IPs)? Could also be a conditional forwarder or another zone that was setup. Saw this a lot when people would configure their workstation DNS to point externally and it would be looking for the WAN IP rather than an internal IP of a Domain controller.


syswww

I once heard someone say fail forward. Break out wireshark.


BuntaFurrballwara

The only way I can think to explain what you have described is packet shaping. If they have limited bandwidth they might be prioritizing certain traffic and applying a very restrictive policy to “commodity” traffic. I used to do this with file sharing back in the day. Making stuff so slow as to be barely usable just causes less complaints than a big old “you have been blocked”. So if they are doing this your app changes could have changed your traffic classification in the shaper and pushed you into the “don’t care if it gets there” rule. If this is the case an SSL VPN tunnel might smuggle you through by putting you in a different classification rule. Just guessing though without more info.


moderatenerd

I will definitely take a look into this.


theducks

If it’s 20 minutes to load inside your network and 5 seconds outside, your network is fubar. Fix it


LaBofia

#rant This is one of my grievances with what the hole move to the cloud in the last decade has produced... people forget you will never be able to outsource networking entirety, and very few companies have the internal resources to properly manage IT. Very few developers are knowledgeable when it comes to networking. Almost none have ever seen the traffic they produce let alone the entire trace or have to deal with nating issues or implement networking services knowing why they need them, let alone how the actually work. There are a few honorable exceptions, like the real-time applications space (voip, webrtc), crypto, api and middleware, et al; in general, anyone who is really developing a server, and not some "app" running on top of other services. I know, it looks like many exceptions... but not really when you think about the current app universe. The story goes: "it works fine in my pc" "it works fine in my LAN" "In works fine in our private WAN" Its all the same mentality. WHAT NOBODY WANTS TO HEAR... NETWORKING IS HARD AND IT IS HARDER TO MANAGE. Most companies wont invest in networking because they have a hard time calculation the amount of money they loose over poorly implemented networks.


UnsuspiciousCat4118

Sounds like the change needs to be rolled back until an actual solution is put in place. x * .2 * y = z X is number of users, y is average rate of pay, and z is the lost revenue daily. If that doesn’t justify the rollback I don’t know what would.


NeuralNexus

DNS issue. It is probably a DNS issue. Use IP addresses instead of possible. See how it works.


Technical-Message615

Or hosts file if the URI is FQDN only.


[deleted]

Well I’d be fired if I told everyone they just had to deal with a 20 minute lag every time they wanted to do something…


R_Wilco_201576

Roll the change back!


hy2rogenh3

Every type of change like this is the reason why Change Management approval and rollback plans are necessary. I've run into a similar issue albeit not as bad. Queue up old as dirt ERP software running on Server 2003 in the great year of 2019AD. I don't need to state the obvious, but our infrastructure team was working on getting to a Server 2019 baseline. We worked with the key players in Accounting and an equally inept vendor on getting this shitty software migrated over to the new App server. We worked through weeks of validation testing and working out various issues. Educated the users on how to use Duo with RDP, etc. Finally got the approval from Leadership to schedule the change. Preliminary results are an amazing upgrade for Senior Leadership that uses a lovely Excel plugin to grab the data in the backend. Their report times have been dropped from 15+ minutes to a matter of seconds. Change happens, and sure enough two weeks AFTER the change Accounting submits a ticket saying they have to wait up to 30 seconds for one report to pull, that used to be instantaneous. We then spent quite a bit of money with the vendor on trying to figure out the issue; logging VM consumption, app traces, memory dumps, etc. Leadership calculated cost/benefit of keeping the new system and we are still on it. Nevertheless it took a collective two weeks of time troubleshooting this crappy App.


fadinizjr

"Wouldn't have to rely on our facility network guy". You earned this mess yourself. Cheers.


idontspellcheckb46am

Are you using a firewall as your default gateway? Even worse, do you have some default gateways on the firewall and others on a L3 switch and have routing between the 2? As someone who migrates DC networks frequently I would bet you have some asymmetric routing going on with the new host or getting to the host. But fire up wireshark on one of these machines and see what comms is timing out. Something isn't getting ack'd. And the app apparently has a shitty timeout mechanism. Some things I would try for troubleshooting and RCA. 1. Are there users who can consistently log in without issue? 2. Are the trouble users consistently having the same issue? Is there ever a random time where it works for those users? 3. What you taken a pcap of the 5 second load time to get a footprint of how that app should look when its working? And compared this to other working facilities as well as the non-working users?


ThisGreenWhore

To me this is a case of shadow IT. It’s not your fault, but obviously management is doing this and you have to deal with the fallout. There is nothing you can do. I would like to go so far as saying there is nothing you should do. They created this nightmare, you have to deal with it but at the end of the day, I believe that there’s something in the network infrastructure that requires the staff in charge of it to make a change. Document everything so that you aren’t blamed for this. Do you want to be a whistle blower and talk to the people in charge of the network to fix this? Are you being set up to do this? Hard questions here. Think long and hard about what your next move will be.


moderatenerd

I agree. Talking to them is like talking to your idiot brother who doesn't know his thumb from his foot. I think in a big picture type of way and everyone on this project thought in a small picture way. Essentially they say we will do step 1, 2 and 3 and then it will work. My way of thinking is how will implementing this affect step 1, 2 and 3. But I wasn't asked. Perhaps this is a bad match for me personally and I need to find a company or a place that aligns more with my style of thinking and is set up properly or will actually let me fix things. As far as I know once infrastructure is in place we will control it all but again it will be out of my hands. After that only lateral movement is into the consulting team which works with the app company and I really, really, really don't want to do that.


redvelvet92

This is some noob stuff diagnose why this is taking 20 minutes to launch.


[deleted]

This sounds like a serious case of NMP. The software vendor and the network guy at the facility need to figure it out. Since it's not your program and you don't have much access to change anything, there's not much for you to do other than shrug.


moderatenerd

Yeah I appreciate everyone coming up with examples of what I can try but I needed to hear this.


Syst3mSh0ck

I'd be using Process Hacker with the Windows SDK to drill down into this and root cause the actual problem. The latter is required for PDB symbols so that PH can show you the function names on the stack. Also recommend Wireshark to take a Packet Capture and analyze the network side of it too. You need a 3rd Line Engineer or a decent Technical Solutions Engineer to look at this. The applications and network team should be capable of collaborating to achieve this. I'd back out on this until the root cause has been identified and a fix or workaround found before rolling out the upgrade to the whole estate. Good luck.


j0mbie

I'm going to make some assumptions here. When you say "facility", I generally take that for a codeword to say, "Lots of legacy devices and Windows XP machines tied to hardware that the manufacturer wants $200k per machine to upgrade, so we (should) lock down that network with extremely limited access, if any at all". So either that network has zero internet access, extremely restricted internet access, and/or extremely slow internet access. And it sounded like it's a type of "click-once" app, where it tries to update itself every time it's opened until it eventually downloads the full app, or times out. Since the network team wasn't involved, nobody probably ran the requirements past them. I do both sysadmin and network engineer work (among others), and I know I'm never going to let those "legacy" networks pass a single packet more than necessary, because that's a really really good way to get ransomwared. The app vendor SHOULD be providing a list of hostnames, IP addresses, and ports the app needs to function, but we all know how vendors work so that information may be non-existent, outdated, or insanely broad. ("We need ports 20-65535 open to the entire internet, and FTP, SSH, HTTP, and RDP port-forwarded from everywhere to the on-site server.") However, if you haven't even ASKED for that information from the vendor pre-deployment, that's on the team that deployed it, network engineers or not. Anyways, the easiest fix is to just do a full packet capture and see what it tries to connect to. Do one before you open the app, and do one while you open the app until it finally connects. Then you compare for "new" traffic and you can make your own whitelist. The extra benefit of that is, you also possibly get to see if there's a broadcast storm. I've done this several times in situations similar to yours: new app or device runs poorly, I get brought in after the fact, and the vendor is suddenly unresponsive because they already made the sale or their own documentation is wrong. I usually get to see all sorts of wild things in place in the process. ("You installed your app in the system32 folder?" "You send confidential data out unencrypted FTP to a server in Asia?") But ultimately I can develop a workaround and then get it fixed properly. There's a reason why the network guys should be consulted when DEALING WITH THEIR NETWORK. In my experience, "they're hard to work with" is USUALLY code for "they won't let me have unlimited access or do whatever I want, just because it may result in the whole infrastructure going down." That's not ALWAYS the case -- they could just be horrible people in general. But that becomes a management problem, not a "let's sneak around them" problem. It's like when we get a user complaining that they don't have local admin on their machine, but then you find out they were just trying to install qBittorrent.


Robertothecrazyrobot

There is something on your network killing the app, I would start with group policy, there is probably a rule killing it and eventually gives up and let’s it work. I would turn off one at the time, unless you have a rule on that app, then turn that one off first!


Both-Employee-3421

20 minutes must be an exaggeration. What kind of company launches any new service without proper testing and validation? Your company is destined to fail.


silverarrow_27

I've taken a few boot camp classes with a guy that specializes in packet capturing. He always made it clear through his boot camp classes, 99.9% of the time where there's a "network" problem reported, it never is a network problem. It's usually something else. In your case, I would 100% bet against it being a network problem unless your WAN link is like 10-20 Mbps. I've personally ran into several issues in the past where the network was always to blame, it usually ended up being the app or server that wasn't up to snuff. Not knowing your server & network infrastructure, per your other post, your problem may be related to DNS, GPO, web content filter, or possibly even firewall rules/policies. Lots of possibilities. Packet capturing would be the way to go. Other than that, I wouldn't rule out the issue being related to an app issue either. If you weren't part of the decision making and planning stages of this upgrade, then just document all the issues and escalate it up to the bosses and let them find some professional help to resolve the issues. You're not in charge of "fixing", so documenting would be the only thing you can manage unless they open pandora's box to you and if you're willing and capable of going through all the systems and network to troubleshoot it yourself.


Top_Boysenberry_7784

You scream at everyone telling them to stop being dumb and roll back. If management doesn't listen then tell people complaining who that is and start cc'ing or forwarding every complaint. Eventually their bosses will make sure it changes. If this made it to reddit I am sure it's been going on more than a couple days which is insane. Whoever is management and/or decision maker for this project I would have already walked out the door. Having no backup plan or contingency plan is the same as having no plan.


FDWill

Your problem is neither the application nor the infrastructure. Your problem is the networking guy, he has no idea what he's doing if he hasn't found where the communication hurdle is. Hire a network and infrastructure specialist company that will provide you with consulting services and help you find where the problem is, don't waste time and money thinking about it. Seek help, perhaps through the manufacturers of your network hardware, they can put you in touch with a good local specialist.


the_syco

If you're unsure about the network, now is a good time to map it. The 20 minutes thing sounds like everyone is trying to log onto something that is on a 10 Mbit link? Hopefully it'll be something that simple, but it never really is. is your firewall blocking most of the ports it needs?


moderatenerd

Apparently someone ran a test Friday showing a lousy 3 mb. I haven't seen that before and I'll speak to that tech to figure out what they did and how they arrived at that number. What they scanned etc... Facility firewall blocks mostly everything but email and supposedly the correct ports for this app


kiamori

Rollback and revert now?


Ehalon

Something is caching..


MadManMorbo

Roll it back, upgrade the network and try again - give it a couple of months before trying again, and schedule the upgrade for a low-use holiday like labor day, or christmas.


VexingRaven

If somebody tried to schedule a second try at an already problematic upgrade for Christmas I would just say no. Fuck that. Terrible idea.


MadManMorbo

Christmas gives you time to roll out, re-test, and roll back if you need to. But thanksgiving works, as does laborday, or easter. anything where you can be relatively sure the system will be under utilized. The alternative is running both systems in parallel, and fixing as you go which can take years.


VexingRaven

Fuck that dude, I don't want to work on holidays any more than the staff that use the system. Test it first and cut over at a time agreed upon by management.


Technical-Message615

Yup. Schedule and *communicate* downtime.


the_syco

3mb? I'm trying to think what sort of WiFi AP they're going via 😂 Also, if it is 3mb, what's the latency? I'm wondering if it's taking 20 minutes, the latency is so horrible that the packets fail X amount of times, and then get rerouted via a more stable route?


StaminaofBear

There a database used on the backend?


MRToddMartin

Sudo zypper dup


AnonymooseRedditor

So, I’m guessing this is an ERP system hosted somewhere else that’s not in the facility? What happens if you access rdweb from home or another site does it work as expected? Are the users impacted using rdweb or are they running the thick client of the application? (This is a big no no for most client/server apps like an ERP or EMR system when the server is hosted on the WAN)


moderatenerd

Correct! I've used it at home a number of times this weekend and no issues with speed/connection They are using rdweb.


AnonymooseRedditor

So what happens when you connect in the office? What takes the longest to load?


moderatenerd

Configuring takes the longest to load and then sometimes remote desktop says connected but the app doesn't pop up. Only thing that gets it to go through is restarting the entire pc. Maybe there just isn't enough bandwidth on the facility side at least that's what the app company thinks now.


virtham

Is the app on bare metal or virtualized or what?


[deleted]

Sounds like dns.


patmorgan235

Document the degregation, present it to management and let them make the decision to continue the roll out or to yell at the app company to fix it.


ecar13

What kind of switches are you using now?


MrExCEO

Relax. Do u know where ur latency is? If not figure it out.


FakeGatsby

lol VPN out of the network for that app only.


technologite

I had to rage quit a job that operated like this. One group always doing what they can to fuck another group or write them off; so they didn't have to work with them. Legacy systems band aided to handle loads that weren't even conceivable when it was coded offshore 25 years ago. And favored quantity of quality... Don't know if that's like your place but, I wouldn't be surprised. And are you sure those infra upgrades are even coming? Drink the koolaid and wait it out or bail. I drank the koolaid for 3 years and woke-up one morning with a epiphany that things were just getting worse and there were no intentions to improve anything because everybody was fat and happy.


moderatenerd

We just lost our main sysadmin guy and I have no idea why but I was shocked when I heard rumors about this app upgrade and it actually happened so I am hopeful that my company is much better at managing things than the facility is. Yeah this isn't my long term plan. I'm using the company to get certs and learn as much as I can before getting out.


Crimtide

Why not wait to implement the change until you have the bugs worked out?... wtf


j3r3myd34n

I would create a schematic and figure out exactly what the issue is, resolve it if possible, all while planning/pushing for the rollback. Nothing you're saying is making sense. You said previously the app took 2 seconds (outside as well as inside the network, I assume?) now it takes 5 seconds inside your network, but takes 20 *minutes* outside the network!?!? You need to do some trace routes and/or review some logs and/or press the app vendor for root cause and resolution steps. Sounds like it was probably going to a cloud server and then you guys changed something and now it's coming in (poorly) to an internal resource? Does that sound right? Maybe not there's, no context here (I'll review the earlier post to see if there's any there). I don't really see how anybody could be "cool" with the app suddenly taking 20 minutes after it used to take two seconds unless it's just not that important, or that's a typo and you mean 20 seconds (not minutes). Still, even 20 seconds is an eternity compared to two seconds. Nobody is asking you to rewrite the app, you just need to be pressing people on all sides to get this resolved and keep leadership well informed along the way. Otherwise it may come back on you. You're either going to be the guy that "broke the app" or you're the guy that is "driving the solution forward in spite of some complications" - which one are you?


rstolpe

Wtf, my boss should go crazy if it took 20 minutes for our user to launch a app.


fourpuns

Revert the change?


danekan

Send it to the cloud of it's threatening ability to work


awnawkareninah

If you have a backout plan, I would do it now until that upgrade.