Autodiscover test tool - (c) from Kip Ng

May 17, 2022

SharePoint Clustering Techniques for High Availability SharePoint

A keyword that's often thrown around when discussing SharePoint architecture is "clustering". The problem is that it's often not very clear what it means in the context of SharePoint, so here's a quick article on what this might possibly mean so we can all be on the same page when we talk about this term just for clarity sake.

Clustering Techniques Available for High Availability SharePoint

So here's how we can "cluster SharePoint" to make it more highly available. There are several options; in no particular order:

SQL Server Clustering for SharePoint

One of the first meanings I assume for clustering is SQL Server clustering for the SharePoint databases back-end, of which there are two types – failover and AlwaysOn. Believe it or not, both can be used at the same time.

Failover clustering is just providing a single logical SQL Server instance over X passive/1 active server (3 passive, and 1 active server for example). The idea is that when the active server dies for any reason whatsoever, another server will pick-up where the previously active server left off. Automatic failover is key to the whole idea, data is shared between all nodes, and this is the more traditional SQL Server clustering around – it's quite common to see in fact.

SQL Server AlwaysOn clustering is a bit more complicated. It's X logical SQL Server instances for which a single logical interface (the listener) may or may-not exist, and the primary node may or may-not automatically failover between the instances. Some people use it like failover clustering – a single instance for automatic failover; others SharePoint admins use it to backup data to a separate site (for disaster-recovery for example), and you can even combine uses. It's sort of failover-clustering + mirroring rolled into one.

So that's SQL Server clustering. SharePoint can use all or any setups since SharePoint Server 2013, but ultimately to SharePoint it still just connects to a single logical data-source.

SharePoint Web-Front-End Clustering

This is otherwise known as network-load-balancing (NLB). SharePoint is very commonly used in this configuration; x2 or more servers will be dedicated to just serving web-pages to users, the requests of which come through a network load-balancer.

"But an NLB isn't a true cluster!" I hear you exclaim. Well actually it is, albeit outside the core functionality of SharePoint itself. A cluster is just a collection of servers to service a single end-point, and an NLB is a perfect example of that. You can take a server out of the NLB (to reboot the server for example) and outside traffic will carry on quite happily, albeit now with more load to the remaining active nodes.

SharePoint Application Server Clustering

SharePoint needs application roles to handle service-application requests. As that statement is both obvious and slightly dry, here's a real-life example.

Web-part audience filters; they use user-profile data to know if the audience setting applies to the current user or not and therefore whether to show the web-part or not. This type of query requires the user profile service-application, which in turn needs & will send requests to application servers running the "user profile" service. These calls will work fine (and therefore the page loading with the web-part too) as long as just one of the servers responds to the user profile service request, so always have at least x2 for each type of service. If there's only one server in the list of "user profile servers" and it doesn't respond, then that's a fatal error for the page rendering.

Again, this type of invisible failover ability is technically "clustering", and actually is very handy at dealing with unexpected outages. More on this architecture here.

Related to application-server clustering in is search clustering and AppFabric clustering.

Virtual Machine (VM) Clustering

Some people like to cluster the machines that run SharePoint Server (or some other dependant server). If the machine host running the VM dies, the machine is failed-over to another host, sometimes in another data-centre even. Hyper-V can do this quite nicely and so can other hypervisors too.

SharePoint keeps running and life goes on blissfully unaware of the disaster that just happened.

Important: This isn't supported with any version of SharePoint as there's no guarantee of data parity between VM replicas. The only thing that we do support for SharePoint is failing over to Azure via Azure Site Recovery.

SharePoint Farm Clustering/Disaster Recovery (DR)

SharePoint DR is pretty simple; it's simply having x2 SharePoint farms that share the same content. The primary farm is in read/write mode until we move users to a passive farm, which then becomes the new primary. Every farm shares the same content so should be identical to the user, albeit with different configuration & service application databases in parallel.

The idea is that an entire SharePoint farm can die or go offline for some reason and SharePoint users will still be able to use SharePoint. More on SharePoint disaster-recovery here.

Again, this is also clustering because we've doubled-up the SharePoint servers to provide the same service – running a whole SharePoint farm. It's unlikely anyone will mean SharePoint DR when they say "SharePoint clustering" but it's worth knowing about just in case.

Which Clustering/High-Availability Techniques to Use for SharePoint?

Good question. All of them if there's enough budget for it :)

If I had to prioritise though, having a disaster-recovery farm is pretty high on the list in my opinion. After that, SQL Server AlwaysOn gives you x2 replicas of the same data, and having the SharePoint WFEs & app-servers clustered too will give you a pretty resilient SharePoint farm.

Hyper-V clustering is probably last on my list of priorities because it's not easy to setup and it's just for guaranteeing uptime for virtual machine instances. On the other hand, my preferred designs assume failures will occur at a virtual-machine level and just makes sure SharePoint isn't affected by them. Failures will always happen; it's how we handle them that counts for high-availability.

Wrap-Up

So I hope that's cleared-up what might be meant by the term "SharePoint clustering". In short, it can mean all sorts of things so hopefully this will help clarify exactly what's meant, and at what level. They're all useful for making SharePoint highly available so should all be considered if you care about SharePoint uptime.

Cheers,

// Sam Betts

SharePoint Disaster Recovery Failover Techniques

So you've got or are interested in two SharePoint farms running in parallel and you want to know how to switch users between the two farms for when you need to. There's a few options to do it; choosing the right one largely depends on how quickly you need to be able to failover everyone, and to what extent you can invest in an architecture to allow said instant failovers.

I say "disaster recovery" but you can also call this "passive/active" SharePoint farm failover too; either way you have users on one farm and you want them to go on the other farm for whatever reason. Here are your options.

SharePoint Farm Failover Methods

There's a few ways of performing your failover, depending on how much effort you can spare to setting it all up beforehand.

Content Database Failovers 1^st?

Something you'll want to think about is what order to failover users & databases, given there's no clean way of doing both perfectly at the same time.

Generally speaking, for nice & planned failovers in a maintenance window the order for failover should be:

Failover SQL Server; make the up-to-now primary SQL instance the now-to-be (readable) secondary, and the previously-read-only passive SharePoint farm instance the now-to-be primary.
- Once done, the primary SharePoint site will now be in read/only mode for users. Hopefully they'll be expecting this.
Failover users via one of these methods below.
- Users will start using the secondary farm immediately or not, depending on which way you failover users.

It's impossible to guarantee nobody will see a read-only SharePoint until everyone + databases are moved over to the secondary farm, so personally I'd have a "maintenance hour" if this is a planned failover so users can expect a couple of speed-bumps as everything switches.

If it's an emergency then, well just failover ASAP of course; we're way past caring about perfect user transitions if it's a firefight.

DNS Updates

This is a pretty standard way of redirecting; you have DNS "A" record www.contoso.com pointing at a network-load-balancer on farm 1. You just update the record and wait for the clients to notice.

Nothing really that magical here; we just depend on the time-to-live value of the "www.contoso.com" A-record. That's to say, once we update the record the changeover won't be instant for clients that have the DNS info cached already, which maybe an issue.

Pros:

Easy to execute.
No reliance on fancy architecture.

Cons:

Delayed response for clients.

Network Load Balancer (NLB) Reconfiguration

This is a bit trickier and assumes you even have a network-load-balancer for web-front-end traffic. In short, you reconfigure the NLB to just send traffic to the other farms web-front-ends and that's it.

Again, not overly exotic the solution but it'll work with instant effect – no dependence on clients to update. Just make sure the configuration doesn't at any point send users to both farms as that'll cause all kinds of havoc (nothing damaging; just web-form updates for example may start to fail for users, etc).

Pros:

Instant effect.
No reliance on fancy architecture.

Cons:

A misconfiguration will give split-farm syndrome – all requests need to go to one or the other farm, without fail.

Reverse Proxy Redirect

This is probably the most problem-free in terms of failover risk, but also the most complicated to setup. It basically just means you have to configure both farms to accept the internal & external URLs (and please, make sure you thoroughly test this is setup right or bad things can happen).

The nice thing here is that because all the clients are already pointed at the reverse-proxy, there's a single reconfiguration for the published application there to a new endpoint and that's it.

Pros:

Instant effect.
Simple reconfiguration.

Cons:

Complicated to setup.

Things That Can Go Wrong on Failover

Not much really. Item edits may fail if you're caught mid-failover without realising. Say you opened an edit-form from the primary & since editing the failover happened; the update would fail if sent to the secondary farm.

Probably the most potentially dangerous is the NLB approach if mishandled as that actually could theoretically send users to both farms. Still though, only one farm is ever readable so nothing too damaging could really happen.

That's it! Comments welcome.

Cheers,

// Sam Betts

SharePoint Disaster Recovery vs. Active Passive Farms

Just a quick clarification on terminology & methodologies for SharePoint "disaster recovery" (DR).

In case you didn't know already, multiple SharePoint farms can be run sharing the same content data, which is very handy if you need near 100% uptime for your SharePoint sites & apps. If for any reason your primary farm dies, you have another farm waiting for you. This post isn't new in ideas; this is mainly just a quick note on terminology & why there are these differences really, as there is still some confusion.

This trick is achieved by sharing certain databases between two farms, with said databases being in read/write on the "primary" side. All the users are on the whichever farm has read/write access to the databases with the other side(s) receiving any data changes & running in read-only mode.

If for any reason we need to switch farms, we put the read-only DBs in read/write mode and send everyone to the other farm.

SharePoint fully supports having service-application & content-databases in read-only mode and can switch between the two modes without any intervention.

A Quick History – Secondary Farm for DR Only

Back in the day, before SQL Server databases could be kept in sync with either mirroring or log-shipping. At least in the case of SharePoint DR with log-shipping at least (which was the most common DR method), the problem was that switching-over the read/write SQL instances was a "one-hit" use. That's to say you could easily switch from primary SQL instance to the secondary instance (in the event of the need arising), but going back again took a lot more work.

In the event of a failure on farm 1, we'd go from this:

…to this…

The only problem is that the switch back for SQL can take a while to get online & ready to failover again.

Anyway, because of this "one-way" failover nature, the secondary farm was strictly used for emergencies only, hence the concept of the farm being "only for disasters" and therefore the name "disaster recovery farm".

Normally this was fine because it's rare for multiple critical, farm-wide problems to happen on both sides. Not impossible, but rare, so this was still a good option to have available even if it was just for critical failures on the primary farm only.

Enter: Quick-Switching Failovers with AlwaysOn

These days with SQL Server AlwaysOn, SQL instances switching is very much easier than before. For one thing, when we switch primary SQL instances, switching back doesn't require any extra work.

What does that mean? Well it means now we can switch between the two SQL instances much quicker than before.

What does that mean? Simple really; enter the concept of "active/passive" farms rather than the more dramatic "disaster recovery" farm model. With active/passive farms switching is much quicker; just a simple SQL failover + users redirect each time.

Now we can go from this:

…to this…

…and back again. And back again; no restores or DB tricks needed except a failover.

What's the Correct Name & DR Model Then?

Good question; a lot depends on your original plans for the 2^nd farm. In general, the old-style "only in disasters" model is a bit rubbish as it relies on having a disaster to know whether everything will even work, as it is the "nuclear option" for disasters. I've seen cases where everything "worked" on the DR site but when push-came-to-shove, the hardware just wasn't ready for the same load; and of course, this only became apparent when the primary went nuclear so everyone was moved over to the DR site.

Partly to avoid unpleasant surprises, my recommendation is we forget about "nuclear option DR farms" and just get used to having two equally active SharePoint farms (as in both switching between active/passive). Now we have an easy way of failing over there doesn't seem like much reason not to, and it guarantees that both could work in the event of an emergency.

Remind me Again – What Could Force a Failover?

Patching & updates mainly; it's a necessary but risky business, on any platform.

Patching Windows, SQL Server(s), SharePoint installations; all of these have risks in that there's always a chance whatever update will break services. From Windows platform to .net to SharePoint; despite our best efforts, servers are complicated beasts so very occasionally something may slip through the cracks.

This isn't something specifically tied to Microsoft patches either; I challenge someone to find a vendor that doesn't have occasional patching fracases. Hint: there aren't any.

The solution to this is simple: test patches first, and have another production system on stand-by in case something critical dies.

Have no single points of failure in other words, and active/passive farms are the ultimate high-availability SharePoint solution because they're the only architectures that achieve this.

Patching isn't the only reason; your own code may cause issues, and it's nice to have quick failback options. I can think of more than more big SharePoint customer that's benefited greatly by having this capability for handling bad code rollouts.

There's all sorts of reasons why it's worth doubling-up on farms, but mostly it's for the reasons you can't think of you'll need it the most ;)

Cheers,

Sam Betts

SharePoint Performance Monitoring with Azure

Aside from epic scaling possibilities, another cool ability Azure gives us is HTTP endpoint monitoring – the ability to check our SharePoint apps are responding nicely, even from various locations around the world if we want it.

This is fairly quick article mainly thanks to the brilliant simplicity of how this works.

Add Endpoint Monitoring for Cloud Services URL

So in my SharePoint/Azure environment I have it publicly exposed via the cloud-services URL. I want to monitor how quickly the default page loads as a rough metric of how stressed the system is (there are all sorts of metrics available of course).

So, I go-to the SharePoint WFE and click "configure":

At the bottom I can add endpoint monitoring. This is so simple it's insane; all I have to do is add the URL for Azure to ping and then from which parts of planet Earth I want to ping from.

Save changes and that's about it; Azure will now ping the URL every 5 minutes & log the response time so you can monitor it, along with other metrics.

Here you can see the response time from Amsterdam, taking a suspiciously long 10 seconds to send back the page.

Why so slow? Well the site in question has a quickly hacked together web-part that lets me slow down the page render by whatever I configure in the web-part properties; 10 seconds in this case:

It's not pretty but it works, and it's handy for the next demonstration…

Add Rule to Monitoring Response

Even better, once you've got your endpoint monitors created you can now setup monitoring alerts on them. Add a rule for any of the metrics just by clicking on it & "add rule":

This will let you fill out a name/description, and then the conditions for the rule:

…and that's about it. Now to what happens when alerts activate or resolve…

Performance is Bad - Alerts Activated!

If the alert becomes active it means the condition for the rule you setup is being met, or worse. In my case, pages are taking 10 seconds to load so the condition is easily met and Azure sends me a nice email telling me the alert is "active":

Thanks Azure! Now in the portal we can see more information…

Clearly we have a problem with that site; in our case, the problem just being we have a web-part that kills performance.

Resolving Alerts

Once the page loads below 5 seconds, say because said web-part isn't killing performance anymore:

…then the alert should become "resolved" again one the condition is met again (15 minutes; 5 second average response max).

Hurrah – we're back to normal operation again.

But My SharePoint Sites Aren't Anonymous!

In case you've not guessed already, Azure can only make anonymous requests to an end-point. This means of course whatever SharePoint site/page you configure for it needs to be enabled for anonymous access of course in order for this to work.

For some that might sound unappetising at first but all that's required really is a test site that's setup exclusively for this performance-pinging purpose. An entirely self-contained site somewhere with no confidential data (or even just with junk data) but with enough data to make a page-load as life-like as possible for a real user. Making the ping-tests load about the same amount of data from the site lists is the goal; to generate the same load on the farm with ping-tests as a real user would on a real site, and of course enabled for anonymous access so Azure doesn't just get HTTP 401s back each ping.

At the end of the day, this is about gauging performance of how the whole system responds; any test site will share the same content-database, IIS configuration and hardware limitations as your non-public sites. We just want a dummy site that should give the servers the same load as a normal site, whatever & however that will be. Once done, Azure can provide the ping-tests & alerts, and we can figure out what to do about any problems that Azure flags.

Wrap-Up

That's it! It's pretty simple stuff really & very easy to setup, but quite powerful if you need to react quickly to slowness. As we have our farm in Azure, a simple solution could be to re-dimension various virtual machines until we can figure out the culprit bit of code, albeit for our thread-sleeping web-part that wouldn't have helped of course.

Incidentally, there's nothing stopping you from monitoring an on-premises SharePoint farm the same way; Azure just needs a VM to add the HTTP endpoints to. But anyway, have a play – it's all cool stuff!

Cheers,

Sam Betts

SharePoint Search Service - Failover & Outage Resiliency

I thought I'd share some tests I've done on how much more resilient the new search engine is to server outages now in 2013 just because I've done some research on it just recently. It's especially nice for consuming farms because there's a nice abstraction from any server apocalypse going on in the service publishing farm; the consuming farm just carries on anyway and everything keeps working nicely.

Back in the days of 2010, if your single search administration server went down then you could kiss goodbye to the search service application it was administering and also any relying apps/services/web-parts/pages that needed it. No more in 2013; there no longer need be one single point of failure for your search topology, if you have the hardware set up a decent topology that is.

Anyway, here you see a connection to a published search service from a consuming farm:

The publishing string for the published search-app is:

urn:schemas-microsoft-com:sharepoint:service:d184aa7911cb41269598b3780592ff52#authority=urn:uuid:ef685d49d59c4782b594f23f163d11eb&authority=https://sp15-search-crl: 32844/Topology/topology.svc

Notice the server being mentioned there.

High Availability SharePoint Search

If we look at that topology on the publishing farm I've basically triplicated all the services. Admittedly this isn't so normal for performance reasons but it's setup that way just to demo the point.

Now let's kill the server in the publishing string.

We can see the effect fairly immediately in the search management page:

Clearly there's a problematic server there; one that the web-front-end in the other farm was going to use.

If we look at the event-logs on said WFE we can see a health warning thrown up by a timer-job – it's nice to know things might be a bit stormy; the consuming farm isn't aware what the impact will be of course so flags it just in case.

Never mind though because crucially, as there was no single point of failure, searched still work no problem for our consuming farm/web-application:

Here we see the web-front-end (WFE), it has adapted just fine to the outage and search results are coming in anyway.

Looking at the WFEs logs you'd be forgiven for not realising there was even a problem.

Notice the new server-name in there (there is one; the screenshot isn't particularly clear). No errors or warnings; as far as that concrete query operation is concerned there is no problem.

And that's it!

Obviously it's not magic; if there's not enough redundancy built into your topology then it'll all come crashing down but I could turn off any one of the search servers and nothing really would happen. It's a highly-available search solution, finally!

Also this isn't specific to published apps either, just the messages are nicer. Anyway, I hope someone found it useful!

Cheers,

// Sam Betts

SharePoint Server and Web Application Proxy – continued

So having played around with the Windows Server role "Web Application Proxy" a lot more recently after some initial testing. I've learnt a few more things about the two working together that I thought worthy of sharing. Windows Web Application Proxy (WAP for short) is a great way of protecting SharePoint from being exposed directly to the internet at large – it's not as feature-complete as TMG Server, but it's a nice wrapper that's very simple to setup & secure.

Specifically I wanted to focus on the two different ways Web Application Server can be used to front authentication to SharePoint…

SharePoint Authentication Methods – Claims/ADFS or Windows

There's two modes SharePoint can be used in conjunction with Web Application Proxy + ADFS, depending on how you've got SharePoint setup. If you have ADFS already configured in SharePoint you can use Web Application Proxy in claims-aware mode; just selecting an ADFS relying-partner that's not claims aware is all you have to do, at which point WAP will assume the backend is Kerberos.

In either mode the ADFS server will be used for user logins & authentication for the proxy endpoint; the only difference is how Web Application Proxy validates you on the backend, transparently to the user. Either it'll pass the token directly to the claims/ADFS SharePoint website or it'll convert it to a Kerberos ticket for Windows authentication.

Windows Authentication SharePoint Backend

This mode should be used when you want to keep your Windows authentication in SharePoint without migrating to claims and is the most complicated to setup. It needs the WAP server on the same domain SharePoint servers are on, Kerberos configured & working for SharePoint HTTP, and each WAP server needs Kerberos configured too.

This means:

SPNs for the external address given to each WAP server machine account.
- In my example I have proxy.contoso.com for my external address so each WAP server needs HTTP/proxy.contoso.com in the list of service-principal-names.
Constrained delegation needs to be enabled for the SharePoint application SPN.
- Each WAP machine account needs to delegate to SHAREPOINT\svc_app_pool in my example, which has the internal SPN of HTTP/sp15.

Note: NTML doesn't work for WAP back-ends – it has to be a valid & working Kerberos configuration to the front-ends from the WAP server. I say "valid and working" because it's possible if you have Kerberos "configured" and WAP doesn't work that in fact Kerberos wasn't fully configured and connections were falling-back to NTML for user sessions. Well, this isn't possible with Web Application Proxy – Kerberos must be fully working if SharePoint is to work with non-claims.

Claims-Aware (ADFS) SharePoint Backend

In this mode WAP let's ADFS login, the redirects back to the WAP URL with the authentication cookie set just like a normal ADFS login. This is much simpler than a Windows authentication back-end because there's no conversion needed from the ADFS claims -> Windows token; SharePoint just accepts the ADFS token directly and it all just works.

Another benefit to this mode is that because we're not delegating Windows identities the WAP server(s) don't need to be on the same domain as SharePoint, so it's arguably more secure. The argument being of course that compromising the edge servers won't get you any closer to domain access, which is true, so there's that too.

Use Single Authentication-Method Endpoints for WAP-SharePoint Back-ends

Whichever way you configure WAP server, either claims or not, the endpoint that WAP redirects traffic to must have exclusively the same authentication method configured that's expected for the WAP server.

That's to say that if proxy address is claims-aware/ADFS then the SharePoint site & zone that the proxy sends to should only and exclusively be configured for the same ADFS provider. Equally, if the proxy endpoint is non-claims aware, the endpoint that WAP sends traffic to should only have Kerberos authentication configured.

In fact, if your Kerberos site also has other authentication methods enabled too then the whole thing breaks. For ADFS sites however, if you don't have the target zone configured like this:

…after logging into ADFS, users will see this:

Single authentication methods only for internal WAP URLs to avoid this.

Setup Alternative Access Mappings for Web Application Proxy URLs

Finally, SharePoint does need to know about the URLs being used to access the application, both internally and externally. This is only really necessary if the URL being used to access the web application proxy is different to the URL that WAP is redirecting to – if both internal & external URLs are the same then you can skip this bit.

1^st, make sure there's an extranet/custom/whatever zone with the address the web-application-proxy is being accessed on for SharePoint. Then add an internal URL so SharePoint knows how it's going to be called by WAP.

The URL used has to be added via "internal URLs" pointing at the public zone URL setup for the web-proxy address. In other words, the "extranet" zone is configured for the proxy address (https://spm-webproxy…/) and an internal URL is added for the address WAP is using to forward requests to, pointing at the "extranet" zone. That way SharePoint knows what the "internal" URL is for the outside address, and everything works just fine.

Note: you can't have a public zone already with the internal address already configured for this internal/external match to work – if you try and add an internal URL that already exists as a public URL then you'll get an error when you add it.

Anyway, the point is that without this matching perfectly you'll get errors and specifically with Office, when you try and edit a file, Office is likely to throw its toys out the pram with an "unknown error trying to lock file" error or something similar.

Setup your alternative access mappings correctly and all should work. Also remember that OneDrive for Business will fail to synchronise without AAMs setup correctly.

Wrap-up

That's it! SharePoint with Web Application Proxy both together works great once you know how they fit together. Hopefully this has helped.

Cheers,

// Sam Betts

SharePoint & SQL Server AlwaysOn Failovers – Video Demonstration

To demonstrate how SharePoint and SQL Server AlwaysOn work together, especially with a failover, I did a quick video to show it in action. Here it is, enjoy (https://www.youtube.com/watch?v=se_M1vdriMA if the embedded video doesn't load):

SharePoint 2013 with SQL Server AlwaysOn

SQL Server AlwaysOn is a key tool in maintaining a high-availability SharePoint solution.

More information on how to implement it @ https://blogs.msdn.com/b/sambetts/archive/2014/05/16/sharepoint-2013-on-sql-server-alwayson-2014-edition.aspx

Cheers,

// Sam Betts