In this article I want to share challenges including their solutions which I encountered with the Azure Application Gateway and Web Application Firewall. This article is more a wiki than a blog post and will be always updated if new things come up.
Table of content
- Application Gateway is in Failed State
- Application Gateway does not respond due to exceeded limits
- Application Gateway does not respond without any changes
Application Gateway is in Failed State
Problem:
The Application Gateway is running proper but is in Failed state.
Cause:
- The last update timed out and then the state appeared as Failed.
Solution:
- You can restore the current configuration by running the following PowerShell script.
- This will just restore the current applied configuration and the state will switch from Failed to Succeeded.
- You can also do this via Azure Resource Graph.
#********* Restore Application Gateway from failed state ********#
#Variables
$Subscription = "Subscription Name of the Application Gateway"
$AppGW_Name = "Name of the Application Gateway"
$AppGW_RG = "Resource Group Name of the Application Gateway"
#Script
Connect-AzAccount
Set-AzContext -Subscription $Subscription
$AppGw = Get-AzApplicationGateway -Name $AppGW_Name -ResourceGroupName $AppGW_RG
Set-AzApplicationGateway -ApplicationGateway $AppGw
Application Gateway does not respond due to exceeded limits
Problem:
During a configuration update on the Application Gateway or Web Application Firewall or also just during the runtime, the Backend Health is unavailable for all resources, no metrics are available in the overview blade, Backend does not respond if you try to reach your published resources.
Cause:
- The memory consumption on the Application Gateway was too high because of passing the Web Application Firewall (WAF) limits.
- Due to the memory fragmentation along with logging pressure and scan periodic process the system run out of its memory and crashed.
- The config applied has the Nginx master consumes 22% of the memory. If Microsoft adds other processes as well, the system will have on regular 70% memory consumption and this is before periodic monitoring, logging and other clean up services are running.
- This is putting the system in a place where each modification can fail or high load might lead to the service going down until recovered.
Solution:
- Splitting up the WAF rules to be sure that the limit on the total amount of exclusions is respected.
- Optimize the WAF rules in multiple ways, e.g. combine rules or disable managed rules because of false-positives, to reduce the system load.
- You can find the limit in the Microsoft Documentation.
- Be aware of that in case of WAF-enabled SKUs, the limit is lower then for normal the SKU. This is just documented in the small end notes.
Explanation:
- The WAF Deployment is impacting the Application Gateway because its a part of it.
- Any change to a WAF policy is being deployed to the Application Gateway. So a config change that is done on the WAF policy will result in a config change to the Application Gateway.
- For managed rules, there is no limit as you can only disable them or some of the rules. But the memory footprint is very similar as for custom rules.
- The biggest issue from WAF point of view appears to be the number of exclusion.
- If you apply multiple, e.g 3 WAF rules with the same exclusion to one Listener, it does not count as one, it counts as 3 in regards to the limits on the Application Gateway.
Application Gateway does not respond without any changes
Problem:
The Application Gateway runs in health state but does not respond. All Backends are unavailable, Health Probes are not available, Metrics are not available, Limits are ensured.
Cause:
- There was an increasing memory consumption by the data path process.
- This is caused by memory fragmentation in the WAF’s memory allocation scheme.
Solution
- If you have automated your deployment, delete the Application Gateway and redeploy it to save time. Your service is not available at all so you can restore it. Outage is already there.
- Also open a Microsoft Support ticket and check for the solution.
- They will probably deploy a memory watcher mechanism that periodically checks the memory consumption of the data path and recycles it once it uses over a certain threshold. This helps stabilize the platform in most cases. However, when other processes also use a significant amount of memory, the memory watcher will be unable to bring up a new data path process due to insufficient memory. This will cause the old process to exit without a new process taking its place. When no process is running to serve incoming requests, the end users will be unable to get a response from the client’s service.
Explanation
- This is an issue in the Application Gateway / WAF Backend and not related to you.
- The increasing memory consumption puts high pressure on the underlying operating system and causes instability in all processes running on the same machine.
How do you obtain those CPU and Memory Util graphs? In the Metrics tab, i have the standard documented ones, including “Compute Units” , etc, but would like to see those #s – would they be presented per instance somewhere?