Introduction
I recently deployed a load balanced production WordPress system on Amazon AWS. For the most part, the process was mostly time-consuming with the only major issue dealing with SSL.
First off, it is definitely viable to run a load-balanced WordPress infrastructure using all of the AWS infrastructure goodies and no expenses beyond the AWS charges. AWS Application load balancing works seamlessly, and as a bonus, is able to terminate SSL sessions, which eliminates the hassle of obtaining an SSL certificate (by purchasing one or deploying Let’s Encrypt certificates with auto-renewal cron jobs) for each instance.
WordPress is supposedly stateless, however, I did find in my situation that this is not always the case, and the system worked best consistently if the load balancer was put into sticky sessions mode.
AWS Load-balanced WordPress Setup
I won’t go particularly deep into the setup, other than to list the various high level steps. Various resources on the internet are available for each step, but nothing that is fully comprehensive. Each step brought about its own unique pain points:
- Install AWS Linux on EC2 instances
- Install PHP, Apache, memcached.
- Install FTP and dedicated FTP user in PASV (passive) mode, the only mode that works on EC2, with my FTP client of choice, Transmit for Mac. Filezilla had no problems with the initial configuration, but Transmit refused to connect without config changes to vsftpd. Another painful point.
- Configure EC2 security groups to allow specific TCP traffic to ports (80-http, 443-https, 22-ssh, 21-ftp, passive ftp range to match vsftpd, etc.)
- Configure RDS MySQL instance
- Configure RDS security group to allow TCP port 3306 traffic from EC2 security group
- Install WordPress conventionally
- Deploy WordPress site to the server using plugin
- Configure an EFS volume for the wp-content directory, copy the existing wp-content folder to the EFS volume, then mount the EFS volume in place of wp-content
- Create a Route 53 hosted zone for my domain. Create fairly low TTL settings (5-10 minutes) to allow for modifications
- Modify my existing domain provider’s DNS hosting record to point to Route 53’s servers for name resolution. This can take up to 48 hours so plan ahead.
- Create a new certificate for my domain using Certificate Manager
- Set up a new application load balancer to serve both HTTP and HTTPS traffic using the certificate and across 2 or more Availability Zones. Create a new target group for the load balancer for port 80 only
- Register the existing EC2 instance in the load balancer target group
- Configure Route 53 to set the domain’s A record to alias to the load balancer (you can point to the load balancer, not its IP address, which is handy)
- Configure SSL detection (below) on the WordPress installation
- Configure W3 Total Cache plugin to use memcached instead of disk caching, enable minification, and AWS CloudFront
- Server setup is done!
- Create AMI (machine image) from the existing instance so it can be used again
- Configure EC2 Auto Scaling and test creating new EC2 instances into the target group using a launch template that references the AMI
- Configure AWS WAF with available WordPress WAF rules
- Configure CloudWatch
For the initial round, you can consider the environment as individual standalone servers running off a common RDS MySQL instance and a shared wp-content EFS volume. No per-instance user variables are needed when starting up another load-balanced instance.
Other useful steps:
- Adjust directory and file ownership, group, and permissions to allow Apache and the FTP user to access files. Apache needs 755 (directory) and 644 (file) permissions. Fairly straightforward command line examples can be found; e.g.
chown apache:apache -R *
find . -type d -exec chmod 755 {} \;
find . -type f -exec chmod 644 {} \;
Shared wp-content volume – pros and cons
I am running a shared EFS volume for the wp-content folder. This primarily solves the issue of uploaded content for all servers, but also conveniently solves concerns around deploying plugins, theme changes or fixes to all servers. Content changes by editors and general server administration, with the exception of changes to the root directory (wp-config.php) and WordPress code, can be done as if working on a single server.
With convenience there is a potential downside for plugins or other code that may not have been written to deal with accessing files on a shared volume. As NFS volumes do not support file locking, there could be problems with code trying to write to shared files. One example is the popular WordFence security plugin, which introduces a feature to write to MySQL to get around this issue (which I didn’t get working). Other plugins may be less fortunate. I found entries in my Apache log pointing to file locking problems that increased as I increased the server load. Whether these actually resulted in any actual server issues, I was not able to determine.
The best approach is probably to look at isolating the wp-content/uploads folder as the common mount point and to keep each server’s plug-in folders separate.
SSL Detection – The Big Gotcha
SSL detection was the biggest issue I ran into on the site. The AWS load balancer handles the SSL connection termination and acts as a reverse proxy to send traffic to each target EC2 instance. This traffic is sent as HTTP, not HTTPS. This is why the load balancer instance targets are configured to listen on port 80.
SSL detection is important to WordPress because lack of consistency on whether to use SSL or not can cause mixed (secure SSL/https and non-secure http) content to be served from the WordPress. Secure HTML document content could contain non-secure references to include CSS files. Because modern browsers now forbid loading mixed content, this will cause websites to break.
On my site, the mixed content exhibited in strangely weird ways, such as third party Gutenberg blocks not showing all block options in the editor or subtle style differences. The site may not fully break because only some of the necessary files may be omitted, depending on the plug-in involved, and how it uses SSL detection. Some pages may look fine if they do not use the affected styles or Javascript references.
Detecting the mixed content can be deceptive, as no errors show up to normal users. The developer console in most browsers will show the error, however.
Underneath the WordPress covers, there is a function, is_ssl(), that returns true or false to the caller depending on whether the incoming request is using SSL. This function is used in many places, internally in WordPress and in plugins to help determine if the response back to the caller should be made in SSL mode.
The inability for is_ssl() to return consistent results is what causes mixed content to be returned to the caller, and the main cause is the load balancer sending requests to the server through port 80, masking the original connection’s SSL status.
I eventually found a solution to the SSL detection issue in two ways, using a plugin or using some code inserted into the wp-config.php.
Step 1: Fix the SSL detection.
Plugin Option: The SSL Insecure Content Fixer plugin (https://ssl.webaware.net.au/https-detection/) was able to fix this. The plugin accurately detected the best method out of several to use for SSL detection and suggested that I use it, which I did.
Code Options: AWS load balancer-specific code can be added to the wp-config.php file. This code detects the presence of the HTTP_X_FORWARDED_PROTO header, which is added to all requests by the AWS load balancer, and if present, sets the internal WordPress global _SERVER dictionary variable. This variable is then retrieved by the is_ssl() function to return true or false. This code is just two (one, really) lines of PHP:
if (strpos($_SERVER['HTTP_X_FORWARDED_PROTO'], 'https') !== false)
$_SERVER['HTTPS']='on';
Step 2: Verify your site URL and Home URL contain the https:// prefix. The WordPress Site URL and Home URL, whether in the database, or in the wp-config.php, should be checked to make sure it has the https:// prefix. This will prevent recursive redirects from happening.
If you are trying to access the site that was previously non-SSL behind your newly configured SSL-enabled load balancer, it might not start properly, because the previous non-https: URLs settings are stored in the database. Luckily, this value can be overridden in wp-config.php, allowing the site to start properly. The following code does the trick:
define('WP_HOME','https://your.domain.here');
define('WP_SITEURL','https://your.domain.here');
IMPORTANT NOTE: Both chunks of code must be ABOVE the ‘/*That’s all, stop editing! Happy publishing. */’ comment line in wp-config.php or they will not accurately set up the environment prior to processing the request.
For interest’s sake, when I dissected the code for the Insecure Content Fixer plugin, I was happy to find that the AWS-specific option used the same code as the manual wp-config.php code fix above. The other options in the plugin help deal with reverse proxy strategies from other cloud providers or NginX/Apache combinations.
Sticky Sessions
Though WordPress is designed to be stateless in its base form, plugins or themes may require sticky sessions to be enabled on the load balancer, as was my case. In my case, I suspect that WP User Manager, the content restriction plugins I was using, or some related code, was static caching user role information from the database to use later to check if the current user was in a specific set of roles.
My site uses roles to serve specific content to different user types (e.g. regular attendee versus those that had access to master classes). A regular content site not relying on roles would probably not run into this problem and could run with sticky sessions off, which should help balance the load on the servers. Even so, I found that short sticky session duration (1-5 minutes) appeared to be sufficient to fix any session problems.
No Home Page Browser Caching
An optimization that was required in my case was to modify the cache-control headers for the site home page. Because this is a membership-enabled site, non-logged-in users are prompted to log in when they access any page, including the home page of the site. This is achieved through a 302 redirect fired by the content control plugin redirecting to a separate login page with a query string parameter containing the source page (in most cases the home page) to redirect back to when the login was successful.
Because this redirect occurred on the home page, this caused an issue with browser caching of the redirect – what would happen is that when a user logged in, and subsequently accessed the web site URL (e.g. by way of clicking on the web site logo), the cached redirect would open up the login screen again, leading to confusion.
One way to solve this is to put some form of nonce or ignored parameter as the logo link (e.g. http://your.domain.here?value=nnnn) where nnnn would be some random value that made the home page link unique and not pull the cached version with the redirect.
This method worked for me, but I decided to use an alternative more suited for this website, which was to discourage home page caching entirely on the browser. The reason for that is the home page is the current events page for the site, so content on there is most subject to change, and best to not be cached at all. The additional server load was not going to be an issue. I tried to configure W3 Total Cache for this, but was not able to find that level of control.
I found a way to do this by inserting an action for the modify_http_headers event into the theme’s functions.php file to modify the Cache-Control header for the home page to be “Cache-Control:no-store” if the requested page was the home page, and to be “Cache-Control:max-age=600” for other pages.
Note that the header “Cache-Control:no-cache” was also tested, but did not force the cache to be bypassed as required. The “no-store” option worked as intended for this scenario.
Test, Test, Test
I can’t overemphasize the value of load testing for peace of mind. Load Testing will give you an idea of infrastructure requirements to support the proposed volume. With cloud providers like AWS, Azure, and Google, there is no excuse for not being able to test assumptions on an actual environment, including load balancing, scale out/in, and failover.
AWS makes it quite painless to conduct experiments around scaling out servers, the type and size of database needed, the types of instances needed, etc. You can easily spool up 8-10 or more servers at minimal cost for testing sessions that may only last a half-hour or an hour at most. I was able to test against the production environment before going live, and feel comfortable doing that during low-volume hours.
I am also able to build out a full load-balanced environment for testing or staging with minimal fuss: the same AMIs used in production can be used for this test environment, simply configured to hit a different database by modifying the wp-config.php and adjusting the EFS mount point. This provides deployment consistency in the machine images. When done, the instances can be stopped or terminated entirely.
I wrote Locust scripts running on a separate EC2 instance in the same VPC to exercise the various pages on the production website, simulating loads caused by users logging in around the same time each day, and checking content. While this may not be fully representative of the traffic because static content requests are not part of the Locust scripts, the scripts still exercise the CPU- and database-intensive portions of the site, which are the initial login and content page requests. The extensive use of CloudFront to cache static files will also route these requests away from the servers.
I found that with extensive memory caching using memcached, a smallish database instance was sufficient for my needs – in this case, a T3.small RDS instance running MySQL was totally adequate, and I could probably go a size smaller. The caches are refreshed every 15 minutes and very little content needs to go live right away, except for the person testing the update, and a manual cache flush can be done for that server to check that the content is correct (a side benefit of having sticky sessions enabled).
Likewise, the volume requirements were handled by multiple T3.small instances as well. Testing showed that more smaller servers were better than fewer larger servers. AWS’s pricing makes little difference between X servers with Y capacity or 2X servers with Y/2 capacity.
High volume testing stressed the servers to capacity, allowing me to anticipate potential hotspots like the EFS shared volume. More file locking failure warning messages were found in logs at higher volumes, so this is an area of future research. What isn’t helpful is the utter lack of information in these warnings as to what’s causing them, and to whom.