Google App Engine: issues with dynamic instances and DeadlineExceededErrors

1. Dynamic instances and processing time

This Google App Engine feature came me as a surprise, though it makes perfect sense. Your site is slow if it has low traffic.

Google App Engine runs Python code on instances. By default, instances are dynamic. Instances are shutdown if they do not have enough traffic (requests per minute). Thus, when you get the individual hits to App Engine now and then, App Engine must restart your instance every time for each hit.

When this happens, you see the following in App Engine console logs for every request on low volume traffic:

This request caused a new process to be started for your application,
and thus caused your application code to be loaded for the first time.

It is not always ok to add 500 – 2000 milliseconds processing delay on the top of the normal processing time. Google’s own recommendation was that each page should be served within 200 milliseconds.

There are three ways to optimize this issue

  • Use App Engine premium feature “Always on” 0,30 $ / day which keeps your instance always running
  • Use cron job or such to keep your instance alive (polling once in a minute seems to do the job)
  • Optimize your imports and split your code to several modules with light amount of imports, so that start up is fast (modules are imported only once)

We are using Zabbix software to monitor our sites (sidenote: I don’t recommend Zabbix as the first monitoring software choice as it is very difficult to use and has bad user experience, alienating both sysadmins and developers away from it). This is what we had before optimizations – App Engine was starting a new process for every request:

… and this is output we got after optimizations:

Here is the corresponding diagram after optimizations from App Engine dashboard itself. These processing times are without network latency. As far as I know Google does not expose the endpoints of App Engine hosting, so you don’t know from which site of the world your responses come from. By comparing this diagram to the diagram above, you can see how Internet traffic is affecting to your App Engine application.

2. The PITA of dying instances

For some reason, App Engine instances misbehave sometimes. This causes the HTTP requests die ungracefully.

Normally it is not a problem as you lost few page loads now and then. People are used to “Internet grade” service and can hit the refresh button if they have problems opening a page.

However if you are monitoring your site and the site gives an unnecessary alarm in the middle of the night, waking up your bastard operator from Hell, he will be very angry next morning and tell you to migrate the crappy software from unreliable Python / App Engine to more reliable PHP servers 🙁

This is what you see in App Engine logs:

A serious problem was encountered with the process that handled this request, causing it to exit.
This is likely to cause a new process to be used for the next request to your application.
If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)

After digging in deeper, you see that it is a problem of instating a new object in the database, exceeding 30 seconds hard limit for processing a HTTP request:

2011-03-09 05:06:20.794 / 500 30094ms 86cpu_ms 40api_cpu_ms
0kb Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 2.0.50727),gzip(gfe),gzip(gfe),gzip(gfe)

<class 'google.appengine.runtime.DeadlineExceededError'>:
Traceback (most recent call last):
  File "/base/data/home/apps/mfabrikkampagne/1.347249742610459821/main.py", line 494, in main
    run_wsgi_app(application)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/util.py", line 97, in run_wsgi_app
    run_bare_wsgi_app(add_wsgi_middleware(application))
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/util.py", line 115, in run_bare_wsgi_app
    result = application(env, _start_response)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 515, in __call__
    handler.get(*groups)
  File "/base/data/home/apps/mfabrikkampagne/1.347249742610459821/main.py", line 296, in get
    try: self.session = Session()

So it looks like there is a temporary hick-up in Google App Engine’s Data Store (Big Table?). In the example above the error comes from gaeutilities‘s Session model, but it could be any other model.

It is possible to catch DeadlineExceededError and temporarily work-around it, as shown in App Engine documentation.

The best way to handle this situation is to adjust your monitoring software – Zabbix in our case. Zabbix allows you to configure triggers so that they don’t alarm on every bad item state change. Instead, you can use min() function and trigger the alarm after the trigger condition has failed every time during a monitoring period. Just make sure that the trigger period is at least twice long as the update interval of your web scenario: this way Zabbix can logs at least two item state changes and allows one of them to be failed one.

For example if

  • Update interval of web scenario is 60 seconds
  • Trigger function must check minimal failures of 1 during 2*60 seconds + some buffer = 150 seconds.
{xxx.fi:web.test.fail[de.mfabrik.com].min(150)}=1

This will allow one failed response before triggering the alarm.

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

One thought on “Google App Engine: issues with dynamic instances and DeadlineExceededErrors

  1. Your problem with the gaeutilities session object (“try: self.session = Session()”) might be connected with the fact that creating a new Session object might in some cases trigger additional actions. This additional action is removing some stale session data from the server (https://github.com/joerussbowman/gaeutilities/blob/master/appengine_utilities/sessions.py#L635)
    If this seems a problem you might disable this feature and cleanup old sessions in a cron job (or reduce the “clean check percent”).

Leave a Reply

Your email address will not be published. Required fields are marked *