Sunday, September 23, 2018

The Curious Case of TCP/9808

A Mystery in 3 Acts

Flashback (58 days ago)

Prologue

The Scene: A typical day in Ops
Ops engineer: Oh, I just got an alert that the default certificate on one of our enterprise pools is about to expire. Time to create a change request for the upcoming maintenance window and get it replaced.

Act 1 Scene 1

(Two days later, during the weekly maintenance window)
Our trusty ops engineer, having filed the proper change request and obtaining an approval, creates a CSR, submits it to his internal PKI and generates a new default certificate. He launches the deployment wizard,  imports his certificate, and assigns it to the proper usages ( server default, internal and external web services).

No errors were found, and all appears working as expected.

Following his organization's predefined best practices (trust but verify!)  ,  our ops engineer now runs through his certificate replacement checklist.
He grabs his trusty DigiCert utility for windows, and proceeds to verify that the new certificate is being presented on the known ports  ( 5061, 443, 4443). To accomplish this, he launches the tool on the enterprise pool server, selects tools and clicks “check install” in the certificate installation checker section. He sets his server address  to localhost, sets  the SSL mode to  direct and checks each port, verifying the new “valid to”  date and serial number match the newly provisioned certificate  (Exhibit A is shown below)

certutil1
Exhibit A

Satisfied that this an all other checks have passed, he makes another espresso and falls right to sleep, confident his maintenance has been completed to his intended and expected perfection.

Act 1 Scene 2

(Two days later)
The Scene: A typical day in Ops

Ops engineer: Hey I just got an alert that the enterprise pool is seeing expired certificate errors.
Upon closer examination, he notices that the error message is saying the server connected to itself on port 9808 and saw an expired certificate.

Ops engineer: Hmm, I don’t recall that port being used for anything, but let me consult the sacred texts. Finding nothing referring to this port, the ops engineer consults with some of his trusted colleagues. in each case, no one was able to say what this port was being used for, if at all.

Looking back on the facts of the case, the ops engineer realized this expired alert occurred on the day the old certificate expired, but that cert is not in use as far as anyone could tell.  That “scummy”  alerting system must be faulty, its reading the cert store, not what is in use, he thought.

His colleagues agreed, as they too had been subjected to false alarms from that system. “That must be it, we are not seeing any problems that we know of and no one is calling the emergency line” they declared. “Let’s remove that cert from the store, even though  we usually keep the previous ones around”

They deleted the certificate from the store and went on with their daily routines, confident they had put this annoyance to bed.

Act 2 Scene 1

(Three days later)
The Scene: The war room (No fighting allowed)

The ops engineer has received an escalated issue. A single customer is having problems signing onto one of their room systems. It seems that only a single account cannot sign in on a single system. Curious indeed. Hmm, all the usual things have been looked at and all seem normal. I think its time we alerted the proper authorities.

After consulting with the authorities and providing them with all the required documentation, they waited. And they waited. Executives from all corners were getting anxious… “Do something”, they hollered. “We are”, came the reply. The executives demanded that something be done. “rebuild the account in question, that will certainly fix the issue”. “We don’t know that for certain. and we really need to understand why this is happening…”  was the response.  The ops engineer was able to hold off the executives for a few more hours, but finally, they rebuilt the account.

Nothing had changed.

Finally, they received word from the authorities. “A message in the documentation you submitted has yielded a clue!” they cried. “We found a reference to an expired certificate”

The ops engineer and his colleagues were shocked.

The showed the authorities the server.

The certificate in question did not exist.

Anywhere.

The ops engineer explained to the authorities that the certificate they were seeing had been replaced over a week ago. He showed them the DigiCert utility output from his completed CR as evidence that the cert was not in use. “

“You must restart the server then, this must be a bug in the code” they explained.

Reluctant to inconvenience thousands of people to resolve an issue with a single system, the ops engineer decided to ponder the situation.

“We will get back to you soon..”  he told the authorities.

Act 3 Scene 1

(one sleepless night later)
The scene: The sleepy anxious ops engineer is making an espresso

Not until this moment had the ops engineer put the expired cert alarm AND  the clue in the documentation together. “What if port 9808 is still presenting that expired certificate?” , he wondered.

Firing up his DigiCert checker tool, he pointed it at localhost as before, but this time he looked at port 9808.

HE WAS SHOCKED.

There in cold hard pixels. PORT 9808 WAS STILL PRESENTING THE OLD EXPIRED CERTIFICATE THAT WAS REMOVED FROM THE SERVER. BUT WHAT WAS IT?

That scummy alerting system was right all along.

He fired up a command prompt. He ran the command netstat –aon | findstr “9808”  ( exhibit B). He saw a single listener and 4 “loopback”  connections on the server between port 9808 and random high ports.

He then ran a get-process –pid xxxx using the PIDs from the netstat output.
netstat
Exhibit B

“Hmm , so the listener is rtchost, and the high ports connecting to it are rtcsrv” he observed.
He crafted an out of band change request for that evening. it included restarting the service using the command “restart-service RTCSRV“ and checking the port’s certificate before and after the restart.

Act 3 Scene 2

(One Day, and one out of band change, later)
The scene: The daily ops briefing

Overnight maintenance engineer: “The CR was successful, port 9808 now shows the correct certificate”

Support engineer: “The end customer has reported that they can now sign into the room system”

The virtual meeting room turned to the ops engineer... “Are these events related? so what is port 9808 doing,  anyway? it is not in our sacred texts”, they exclaimed.

The trusty ops engineer virtually looked at his colleagues… “beats me, the authorities will surely know”

With the end customer now working as expected, the ops engineer met regularly with the authorities about his findings, looking for the answer to what port 9808 was.

Some systems had this listener and some did not. He found at least 5 systems that had it. many others that did not. The authorities claimed it was a random occurrence. “It is not random”  , multiple times they told him, and he demanded a better answer each time.  ( sometimes more animated than others!)
He waited for weeks. He made many espresso drinks.

Time passed.

Act 3 Scene 3

(49 Days later)
The Scene: A typical day  in ops

Finally,  someone who had access to the sacred texts of server instructions contacted him.

“We have your answer”  the authority figure currently in charge of his inquiry said to him.

Go to your enterprise server and open the file SharedLineAppearance.exe.config located in “C:\Program Files\Skype for Business Server 2015\Server\Core” he demanded excitedly.

The ops engineer complied.

His jaw dropped. The mystery had been solved.

sla1
Exhibit C
The authorities asked him, “Do you use the shared line appearance feature?”

“Only every day” came the reply.

The authorities explained that port 9808 was used for SIP subscribe message for the SLA feature.

“We have confirmed the fact that when a certificate is replaced, port 9808 does not update its certificate. We have logged an inquiry with our masters We are closing this inquiry, have a nice day”

The ops engineer now understood why some systems had the port and some did not. he still did not know if this issue was related to the room system problem, but at least he knew what TCP/9808 was doing.

Epilogue

“If this port is used for all servers for this feature, why is it not written in any of the sacred texts…”? he questioned.

It was then that the ops engineer realized he had another mystery on his hands, one he would likely never know the answer to.


Fade to Black


This reenactment is based on actual events.   Characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, is purely coincidental.

1 comment:

  1. lol I think the term Authorities is a bit generous given who played the part...

    ReplyDelete