Lync Phone Edition devices stuck connecting to server

I just want to write about a recent experience I had troubleshooting an issue during a Lync 2010 migration to Skype for Business 2015, it caused a lot much head scratching and frustration so I wanted to document the steps I took before getting it working. The migration to Skype itself was straight forward and everything seemed fine after the decommission of the Lync pool, however a few days later an issue emerged.

This particular customer had been using Lync 2010 for many years and had hundreds of Lync Phone Edition devices deployed for both users and in common areas. These phones have not changed much (if at all) over the lifetime of the Lync deployment and users logged in on them were successfully migrated and logged back into Skype during the account move and calls were made successfully.

As part of the migration the DHCP options were updated and the scope moved to a new server so when a few days later some phones started to stop working and just kept displaying “Connecting to Lync Server” the first thought was to look at the DHCP server. Now if all the phones had stopped working at the same time this post would be a lot shorter, but they didn’t. The majority of phones continued to work, even surviving reboots and new leases. The affected ones seem to be phones that had either been logged out or reset (wiped) of user data, a fresh phone was not able to log in using either USB based authentication or PhoneExt/Pin methods.

The first thing I did was check that the right DHCP options were being provided, to do this I first used the DHCPutil tool to emulate a client request, this was performed on the same vLAN as the phones and reported a success. The next step was to run the PowerShell command test-csphonebootstrap again from a laptop on the same vLAN, this command replicates a phone completing the authentication process, and it too passed. Just in case DHCP was pointed back the original server with its options updated to point at Skype but still no luck.

The customer, like most, used an internal Certificate Authority, in this case with an offline Root and subordinate Issuing CA, neither of which the phones trust by default. This is perfectly acceptable and expected since the Lync Phone Edition sign in process includes connecting via http to download the certificate information prior to connecting via TLS or as I’ve found on the Microsoft documentation in some scenarios it can connect via AD to retrieve the certificate or use web enrolment methods to retrieve it, however the root certificate was published correctly within AD. Upon checking the Skype service and IIS logs I could see the requests from the phone over HTTP and it being presented with the certificate chain but still no joy.

My next thought was maybe it was a TLS related issue since it was constantly asking for the certificate information to be downloaded, either it’s not liking the certificates or since the phones only support TLS 1.0 (hence why they’ll soon not be supported against Office 365 with it enforcing TLS 1.2) that it couldn’t negotiate a secure connection so I started down that road. After checking the Schannel section of the registry on the frontends I could see that the servers did not have TLS1.0 disabled but did have some changes made to the Cipher suite to increase security, so these were then removed back to Windows 2012r2 defaults but still no joy.

At this point I decided to try another type of device, since Lync 2010 and Skype for Business desktop clients were still able to sign in I was sure it was client related somehow. Since Lync Phone Edition was released several vendors have brought out certified (3PIP) devices that do not run the Lync Phone Edition OS/App but are capable of registering natively to Lync/Skype. Fortunately I have my own Polycom VVX series phone that I was able to take with me on my return visit to the customer to troubleshoot further, I ensured my phone was using the Skype base profile and reset it to make sure no other settings were retained. After restarting it was able to register using the PhoneExt/Pin method, now we were getting somewhere.

I then struggled with working out the differences between the two device types, the VVX series has accessible logging and management which allowed me to confirm it was connecting as expected to Skype, retrieving the private certificate and storing it but unfortunately getting logs from the Lync Phone Edition was a struggle and when I did they didn’t seem to shed any light.

Since I wasn’t able to get logs to show me what was happening with the Lync phone during start up and registration I asked the customer to set up a port mirror on a switch so I could use Wireshark to see what traffic was being sent in the hope I could see something and work out what was happening. With Wireshark up and running I was able to see the phone start up, retrieve DHCP options and then use HTTP to retrieve the certificate, all as it should be, before it then looped.

It took me a while looking blankly at the packet capture before something jumped out at me, the phone was trying to establish a TLS session with the Edge servers external IP addresses! Internal clients should not be able to reach the external interface of the Edge server and all communication when on the Lan should be by using the internal one so why was it trying to connect externally? (and how?)

The why became apparent when it dawned on me that they did not have split-brain DNS but rather they had been creating pinpoint DNS zones for each of the required hostnames on their internal DNS as well as creating them externally. The phone was carrying out a DNS request for the external srv record (_sip._tls.<domain>) and was receiving a response from public DNS. Now this in of itself wouldn’t necessarily be a problem but it turns out that on their network clients can reach the external IP addresses of servers.

The fastest solution was to create an empty zone for _sip._tls.<domain> since I didn’t want to make firewall changes late in the day or try and work out why they were able to route to the external interface. After the empty zone was created the broken phones were able to sign back in again, happy days!

TLDR: The lesson of this story is to check that internal devices can’t reach the external interface of the Edge servers, as per the Microsoft documentation, since the desktop client was working internally/externally it wasn’t something that jumped out to check.

Bonus info: I was curious as to why the Polycom VVX phone worked so I performed a Wireshark capture on it and could see that after it performed the http request for the certificate it then switched to using lyncdiscoverinternal/lyncdiscover method to authenticate like the desktop/mobile Lync 2013/Skype clients do.

 

Call Queues using hybrid or Cloud Connector Edition PSTN numbers

Call Queues provide a simple way to distribute calls between users in Office365, in a similar fashion to the historic Response Groups in Lync/Skype. When they were first introduced by Microsoft you could only assign them service numbers provided by Microsoft, which was ok unless you were unable to port the number to Office365.

Late in 2017 it was announced https://techcommunity.microsoft.com/t5/Skype-for-Business-Blog/Announcing-new-capabilities-in-Auto-Attendant-and-Call-Queue/ba-p/122962#M697 that it would be possible to access call queues (and Attendants) using SIP addresses, this opened up the opportunity for using on-premises numbers via CCE or Hybrid.

If you’re using directory synchronisation after creating your call queue you’ll receive a warning to tell you that the Active Directory object is missing and are provided with a powershell command to create it. Unfortunately, this powershell command is included in a Skype for Business management tools update so is of no use if you don’t have a Skype Hybrid environment, for example if you’re using Cloud Connector Edition (CCE) appliances you’re unlikely to have the correct powershell module.

The solution, originally highlighted but now removed in the post, is to create the object manually with a upn that matches the sip address and then assign the required attributes, this can be done using the following powershell for Call Queues:

$OU = “ou=,dc=,dc=”
$displayname = “Queue Name”
$lineuri = “tel:+123456789
$guid = [guid]::NewGuid()
$sipaddress = “sip:hg_xxxxxxxxxxxxxxxxxx@SIPDOMAIN.com
$name = “{” + $guid.Guid + “}”
$objurn = “urn:trustedonlineplatformapplication:11cd3e2e-fccb-42ad-ad00-878b93575e07”
$deploc = “sipfed.online.lync.com”
$cn = “CN=” + $name
$dn = $cn + “,” + $ou
$upn = $sipaddress.replace(“sip:”,””)
New-ADObject -type User -Path $ou -Name $name -DisplayName $displayname
Set-ADObject -Identity $dn -Add @{“msRTCSIP-ApplicationOptions”=256;”msRTCSIP-ArchivingEnabled”=0;”msRTCSIP-DeploymentLocator”=$deploc;”msRTCSIP-OptionFlags”=384;”msRTCSIP-OwnerUrn”=$objurn;”msRTCSIP-PrimaryUserAddress”=$sipaddress;”msRTCSIP-UserEnabled”=$TRUE;”UserPrincipalName”=$upn;”msRTCSIP-Line”=$lineuri}

The process is the same for auto attendants except you require the trusted application to be: “urn:trustedonlineplatformapplication:ce933385-9390-45d1-9512-c8d228074e07”

To route PSTN calls using CCE to a Call Queue requires rewriting the destination sent from the SBC to become the SIP address rather than the E.164 number as is usually expected. Upon making this change I discovered that the CCE mediation server was rejecting the call due to an unknown server error.

The solution suggested in the comments was to add ;ms-skip-rnl at the end of the translated address and sure enough this worked in my case too and inbound calls to the Call queue from the CCE succeeded,  further comments suggest that this isn’t always the case so it is worth lots of testing prior to production.

Migration from Lync hybrid to Skype for business online using own PSTN carrier

I’ve just completed a migration of a customer to Skype for Business Online, that’s right, Skype not Teams. The reason for this is that the customer wanted to retain their existing PSTN connectivity and currently it is only possible with Skype, using your own SIP is currently expected for Teams with a preview due at some point in 2018 but as yet no fixed date.

There are two real choices for PSTN connectivity with Skype for Business Online using your own carrier, maintain a hybrid deployment or use Cloud Connector Edition (CCE). In this case a Sonus Cloud Link appliance containing both a Sonus Session Border Controller (SBC) and CCE was deployed, in the future it will be used to provide SIP straight into Teams.

You might ask, why would they just not port/transfer their numbers over to Office 365 and use Microsoft as their carrier? This particular customer wanted to provide the capability of receiving PSTN calls to all their staff but realised that only a small minority would likely require the ability to make calls, to port all their numbers and use Microsoft as a carrier would require more calling plans than they would actually benefit from. Additionally, by using the CCE they are able to keep the voice traffic off their main internet connection to ensure the best call quality available to them.

Now that the approach and design is taken care of I just want to cover a few points encountered during the switch over.

Firstly, it’s worth noting that using CCE is only possible if there is no on-premises deployment, Microsoft reversed its decision to introduce support for that topology in 2017. Whilst configuring a user to use CCE for telephony Microsoft checks whether they’re a pure online user or part of a Lync/Skype hybrid deployment.

How does it determine this? And how can you check? Well you can use Skype for Business Online powershell to connect and check the attribute interpreteduser, for example:

Get-csonlineuser -identity user@domain.com | select interpreteduser

If the returned value is hybridOnPrem or hybridOnline then Office365 thinks there is a hybrid deployment. Now if there is no hybrid deployment as the users have successfully been migrated, the Lync/Skype deployment has been decommissioned and all relevant hostnames have been changed then how or why does it think there is still a hybrid deployment?

The answer is due to the existence of the msRTC* Active Directory attributes used by on-premises deployments, if a user is synchronised to Office365 with these attributes still populated then it will be automatically assumed a hybrid deployment exists and as the interpreteduser field is a system generated value it cannot be manually changed.

The solution is simple, remove all the msRTC* attributes and resync the user and wait, after Microsoft’s internal replication has taken place, which in my case seemed to be around 5-6hrs, rerunning the command returned a value of DirsyncedPureOnline and the user could be configured for CCE.

This could potentially be one of the reasons why it is suggested to keep the hybrid deployment if using your own carrier, however as not all customers want to maintain complex on-premises deployments I just want to reassure that it can be done.

Sonus Cloud Link – We failed to run publish-ccappliance

During the deployment of a slave CCE appliance when it came to the last step I received an error message: We failed to run Publish-CCAppliance.

failedtorun_1

failedtorun_2

I waited around 5 minutes and ran the step again and it completed successfully and the appliance started handling calls, I was unable to see what the Management service was doing or why the automated process wasn’t able to handle it but I’d thought I would share and reassure people that running it again worked.

Forwarding Cisco Call Manager Call to Skype for Business Cloud Connector fails

During a recent Sonus Cloud Link deployment integrating Skype for Business Online with Cisco Call Manager I ran into an issue with certain calls failing. As the customer were piloting the system we were not yet removing users existing Cisco extension but rather forwarding it to their Skype number.

If Cisco users called the Skype number directly it would work but if they called the forwarded extension it would fail, strange.

So first step was to look at the Sonus logs to see if I could spot anything, on the working call everything looked normal

ciscoforward_1

The only difference with the failing call was the addition of the Diversion header entry.

ciscoforward_2

Now I’d found a difference it was time to see if I could change the forwarded calls header to be the same as the working call. Since we were using the Skype Cloud Connector Edition  we couldn’t make any changes on the Skype side as they’d be lost following any update.

Luckily we were using a Sonus to route calls between the two, so I created a Message Manipulation Rule under Message Rule Tables

ciscoforward_3

I then created a header rule

ciscoforward_4

Which then had an action of Remove for the header name of Diversion, I didn’t add any conditional access expression.

ciscoforward_5

After creating the rule I added it as an Outbound Message Manipulation entry on the Skype CCE signalling group

ciscoforward_6

After which forwarded calls worked as expected. I’m not sure why the Skype CCE or Skype Cloud PBX did not like the diversion header but the customer was happy with the result.

Sonus Cloud Link – Powershell module is not ready

Having just completed my first highly available Sonus Cloud Link I just thought I would mention an error message I came across when installing the second appliance.

The Sonus Deployer tool showed a very nice red error message saying the Powershell module is not ready

powershellnotready_1

This was also shown in the underlying PowerShell window

powershellnotready_2

I had seen a similar message during the installation of the Skype Online Powershell module when you try to load it without having restarted, however the ASM server had been rebooted and I was able to load Skype for Business Online PowerShell module and connect to the tenant.

If you read all of the notices and guides you will see mention that the length of time it takes to deploy the second appliance is dependent on the connectivity and speed of the first, which led me to check the first appliance.

I then discovered that the Primary appliance actually had been configured as standalone! It was then a case of redeploying the CCE configuration, this time ensuring the HA Master option remained selected and then afterwards the second appliance was able to progress.

Lesson of this story, ensure you get the configuration correct on the Master appliance and make sure it is saved!

Error upgrading Azure AD Connect on a Domain Controller

Whilst in production and for customers I always recommend installing the Azure AD Connect on a dedicated machine, in my lab however I’m a little constrained with resources so therefore have installed it on the Domain Controller, which up to now has been fine.

Today I decided to upgrade to the latest version (1.1.119.0 as of writing) so I duly downloaded the setup and proceeded with the in place upgrade having successfully done so in the past.

This time it would error out and advise reading the logs, in which I found:

Error 25037.The groups entered do not all exist or cannot be found.

On a standard server the AD Connect will create local security groups to manage access, however since I was using a domain controller this wasn’t possible and nor was I prompted to select custom groups.

So as it is a lab I tried uninstalling it and the supporting components before performing a clean install and lo and behold it installed correctly.

I can only presume the setup process is slightly different during the upgrade and it can’t cope with the domain controllers lack of local security groups.

Please remember, if you are doing this to make a note (or take a backup) of any changes in OU filtering or rule changes from the default prior to uninstalling.