Monday, January 23, 2017

Chrome 56 Will Aggressively Throttle Background Tabs

[Jan 26, 7:25am]: Google has responded to this article and the plans for throttling:
Unfortunately, our current implementation throttles WebSockets. Because of this we ARE NOT SHIPPING this intervention in M56.
The current plan is to disable time-budget background timer throttling for the pages with active connection (websocket, webrtc and server-sent events) and to ship in M57 (subject to further feedback). We will keep you updated with the progress.

----------------

"L'enfer est plein de bonnes volontés ou désirs"

Chrome 56 introduces a commendable optimization to throttle background tabs' timers. From the Intent to Implement, the gist is:


  • Each WebView has a budget (in seconds) for running timers in background.
  • A timer task is only allowed to run when the budget is non-negative.
  • After a timer has executed, its run time is subtracted from the budget.
  • The budget regenerates with time (at rate of 0.01 seconds per second).

This is generally a Good Thing. Browser vendors should be concerned about battery life, and this will do a lot to help. Unfortunately, this implementation is ignoring the new reality: the browser is no longer just a reading device; it is the world's largest application platform.

This will break apps on the web.

When idle, your application's timers may be delayed for minutes. Aside from terrible hacks like playing zero-volume sounds, you have no control over this. Avoiding expensive CPU work is not a panacea; that some applications must do significant work in the background, including syncing data, reading delta streams, and massaging said data to determine whether or not to alert the user.

Worse yet, the heuristic is based on local CPU time; faster clients may have no issue and face no throttling, but slower devices often will, causing cascading processing and notification delays.

Popular applications like Slack and Discord, as well as our own application (BitMEX, a Bitcoin trading site) will be hugely and adversely affected by this. What good is the user granting the Notification permission if we can't do the processing necessary to even fire the notifications?

Before you think this sounds alarmist, here's real data from a few days ago, where I ran a simple `setInterval` every second while our application ran and recorded the actual time elapsed. This shows timer delays on a site that does a medium amount of background processing:


Gone are the days when you could count on timers to fire semi-reliably.

This has dire ramifications for sites that keep WebSocket subscriptions open. In fact, we've already begun work with Primus to remove client-side heartbeat timers which will make it into a new semver-major soon. Emulation libraries like socket.io may have more trouble.

The recommendation from the Chrome team is to move this work to Service Workers. While it's great to have a way out, this recommendation involves significant compromises, development work, and compatibility fallbacks. And who's to say that Chrome won't have to start throttling noisy Service Workers in the future?

If you run an application that counts on background timers firing reliably, please leave a respectful comment explaining your use case in the Intent to Ship thread before this rolls to Stable. The team is listening and has already acknowledged their intent to make this throttling less aggressive. More data points would be very useful to them to ensure Chrome 56 does not break the web.

Tuesday, May 26, 2015

Using M/Monit Safely with AWS

In my main role, I am CTO of BitMEX.com, an up-and-coming Bitcoin derivatives exchange. We run on AWS, and we have a number of isolated hosts and a very restrictive and partitioned network.

At BitMEX, we are wary of using any monitoring platforms that could cause us to lose control. This means staying away from as much closed-source software as possible, and only using tried-and-true tools.

One of our favorite system monitoring tools is the aptly-named Monit. Monit is great at keeping on top of your processes in more advanced ways than simple pid-checking; it can check file permissions, send requests to the monitored process on a port (even over SSL), and check responses. It can do full-system monitoring (cpu, disk, memory) and has an easy-to-use config format for all of it.

Sounds good, right? Well, as nice as it is, we need to get notifications when a system fails - and fast. We configured all of our servers to send mail to a reserved email address that would follow certain rules, post a support ticket to our usual system, and ping Slack. But emails on EC2 are notoriously slow because EC2 could so easily be used for spam.

We found our postfix queue getting 30 minutes deep or more because of AWS's throttling. We use external monitoring tools so it's not always a problem; we generally find out about an outage within about 30 seconds. But if there is a more subtle problem (like a server running out of inodes or RAM) but it takes 30 minutes to get the alert, there's a problem.

I started reconfiguring our monit instances to simply curl a webhook so we'd get Slack notifications right away. That's good, but any change in the hook and I'd need to log into every single server and reconfigure. Besides, I could really use a full-system dashboard. What to do?


M/Monit solves this problem. It provides a simple (yeah, not the prettiest, but functional) dashboard for all hosts, detailed analytics, and configurable alerts. Now, I can just configure all of my hosts to alert on various criteria, and configure M/Monit's Alert Rules to take care of figuring out how severe it is, who should be notified, and how.

M/Monit supports actually starting, stopping, and restarting services via the dashboard, but that requires M/Monit to have the capability to connect to each server's Monit (httpd) server. I didn't want this; setting it up correctly is a pain (have fun generating separate credentials per-server for SSL) and it's a potential DoS vector. Thankfully, M/Monit runs just fine if you don't open the port.

Setting up M/Monit is pretty easy and I don't think I need to go through it; I recommend using SSL though, even inside your private network.

Here's how I set it up on each server (assuming existing monit configs):

set mmonit https://<user>:<pass>@<host>:<port>/collector 
  and register without credentials

That's it. Just reload the monit service and it'll start uploading data.

What's the credentials bit? Well, Monit supports automatically creating credentials at first start that M/Monit can use for controlling processes. But I didn't want to support that anyway, so I added this line to disable it. It's not a big deal if you omit it, and the control port is left closed; M/Monit just won't be able to connect, but it will continue monitoring properly.

That's it! I then used the script above as an alert mechanism inside Monit (Admin -> Alerts) and it happily started sending instant notifications to Slack. You can configure just about any type of webhook, mail notification, whatever.

It took surprisingly little time to get this right, and the team behind Monit deserves a lot of credit for this.


Monday, July 21, 2014

Managing & Massaging Data with ReactJS

I've been working on a new project.  It's heavily data-oriented, and data is changing constantly. I believe it would be very difficult to make a project like this work in a performant manner even 3-4 years ago; it's nearly the perfect use case for React, in my opinion.

I have about 8 data stores, and each client is processing 2-3 websocket messages *per second*, updating those stores. Each store update triggers a render that may be a insert, modify, delete, or complete replacement of a store. Each one of these stores is linked to one or more widgets that must update immediately so that users are informed of the most up-to-date state of the system.

React is a great fit for this because I can modify the data, pipe the proper `props` hooks through the system, and call it a day. But React makes no assumptions about your data, and is completely hands-off about how you should manage it. To help out, I use Fluxxor with some modifications to manage my data stores. But even Flux/Fluxxor does not tell you how to manage your data. So after some figuring, I set about figuring out how best to store my data in the browser.

It appears that the "React Way" is to pass only raw data around to components. This has some distinct advantages, to be sure. Data is much easier to reason about when there are no wrappers getting in the way. However, `shouldComponentUpdate`, the lifecycle event that allows you to skip a rerender in the case of an insignificant data change, because a serious challenge in the event of raw JS data. Javascript's arrays and objects are mutable, which is the norm in most languages but becomes a serious hassle in the context of React. In order to determine if data has changed, you may have to do a deep comparison of all arrays or objects passed to your component, which can take almost as long as rebuilding the component (as virtual DOM diffiing is quite fast).

I'm building an app that has real requirements, and eventually it becomes quite important to massage data. That means adding labels, changing column names for readability, adding derived/virtual properties that depend on other properties (and update properly when their dependencies change), and so on. I thought about this and got a flashback to Backbone - Backbone.Model is one of the best parts of Backbone. Maybe I could just use it raw?

I started working with Backbone as my Model/Collection abstraction, but it didn't offer as much as I wanted, had a lot of cruft I didn't need (Router, Views, History, etc.), and it wasn't easy to update if I removed that cruft. Just about that time, a user on HN mentioned ampersandJS, a refactored and enhanced version of Backbone's data components. It's much better, and if you're willing to leave < ES5 behind, it does quite well with data getters, setters, deep model hierarchies, derived properties, session storage, and more.

Now, I like this, but a lot of it assumes that you want mutable data structures. I don't. So I set upon removing mutability from my collections:



// Collection.js, superclass for all collections

// We always want to mix in underscore & a constructor override.
module.exports = function() {
  var args = [];

  // Remove mutation methods
  var constructor = AmpersandCollection.prototype.constructor;
  args[0] = {
    constructor: function(models, options) {

      // Call super.
      constructor.call(this, models, options);

      // Freeze this collection
      var me = this;
      ['add', 'set', 'remove', 'reset'].forEach(function(funcName){
        me[funcName] = doNotUse.bind(null, funcName);
      });

    }
  };

  // Add underscore
  args[1] = underscoreMixin;

  // Add collection definition
  for (var i = 0; i < arguments.length; i++) {
    args.push(arguments[i]);
  }
  return AmpersandCollection.extend.apply(AmpersandCollection, args);
};

function doNotUse(name) {
  throw new Error("Collections are immutable, do not use the method: " + 
    name);
}

// For instanceof checks - necessary when extending this class.
// This allows components to call `new Collection(models, options);`
module.exports.prototype = AmpersandCollection.prototype;




This allows me to create a new collection every time I make a significant data change, making `shouldComponentUpdate` O(1) while giving me all the benefits that these Collections and Models provide: validation, virtual attributes, nested models, sorting, and so on.

In the end, I found that calling the Collection's constructor on every data change was far too expensive; I have some 100+ element arrays full of rich objects that often change one at a time. I added a helper:



// Lighter weight than creating a new collection entirely.
AmpersandCollection.prototype.clone = function(data, options) {
  if (!options) options = {};
  // Create a new object.
  function factory(){}
  factory.prototype = this.constructor.prototype;
  var newCollection = new factory();
  _.extend(newCollection, this);

  // Assign models
  newCollection.models = _.map(data, function(datum) {
    var model =  newCollection._prepareModel(datum);
    newCollection._addReference(model);
    return model;
  });

  // Sort if necessary.
  var sortable = this.comparator && options.sort !== false;
  if (sortable) newCollection.sort();

  // Remove all references on the old data so it can be GCed.
  // This adds some runtime cost but prevents memory from getting out of control.
  this.off();
  _.each(this.models, function(model) {
    this._removeReference(model);
  }.bind(this));

  return newCollection;
};


This benchmarks quite well: I am able to replace a 150 element collection of large, rich models in less than 0.1ms.

So far, this has been working for me. It creates a fair bit of GC pressure but I am careful to only replace models themselves when they have changed as well, and to preserve those that have not. In a way, it's a lower-tech version of ClojureScript's structural sharing, which is certainly far superior than this. However, I haven't found a good FP-style replacement for what I'm doing.

Have any of you had experience doing this in a similar way, or using Mori instead? What have you found to be the pain points and benefits of your method?

Wednesday, November 6, 2013

A simple demonstration of the benefits of minification on the Healthcare.gov Marketplace. What happened?

(This was originally a pull request on the healthcare.gov repo that was taken down - it has since been moved to its own repository)

Notes


CGI Federal has not released the source to the webapp powering Healthcare.gov. This pull request is not meant to be merged into this repository. For lack of a better place, I have put it here in hopes that it will get some eyes. This PR is directed at CGI Federal, not Development Seed, who has done some clean and responsible work on their part of the project. I hope they will allow me to occupy this space for a little while so this story can be told.

Note: I have moved this idea to its own repository in hopes of sourcing more fixes from the community. Please contribute!

What is this?


This commit is a quick demonstration of how badly CGI Federal has botched the Healthcare.gov Marketplace.

In less than two hours, with absolutely no advance knowledge of how Healthcare.gov works, I was able to build a simple system for the absolutely vital task of minifying and concatenating static application assets. CGI Federal's coding of the marketplace has so many fundamental errors, I was able to reduce the static payload size by 71% (2.5MB to 713KB) and reduce the number of requests from 79 to 17.

This means 62 fewer round trips, 71% less bytes on the wire, and a site that loads much more quickly and with a less than quarter of the requests - crucial during the first frantic days of launch when web servers are struggling to meet demand.

I'm not any sort of fantastic coder. Most web developers would be able to easily complete this step. It is inexcusable that CGI Federal went to production without it, given the absurd amount of taxpayer money they were given to develop this system. Most of the Javascript code that we are able to see was clearly written by inexperienced developers. If they can't even complete this simple step, we have to ask ourselves: is this the best $50+ million dollars can buy? How can such an expensive, vital project be executed so poorly?

There are many other issues in the current system besides this one. This is merely a demonstration of the lack of care CGI Federal has put into this project. Simply put, a single programmer could have easily done this in a day and healthcare.gov would have stood a much better chance against the load this week. Clearly, there is a perverse set of incentives that has dominated the federal contracting system; delivering a quality product appears to be at the very end of their priority list.

Technical Details


The production app on healthcare.gov delivers a very large payload of JS and CSS without making any attempt to reduce load on its own servers. A great benefit could be realized by simply minifying and concatenating all source.

This commit add a simple builder and test runner and rearranges the JS directory structure a bit so it makes more sense. It also refactors some inline JS into separate files so they can also be optimized.
Adding insult to injury is the delivery of nearly 160kb of unused test data to every consumer of the app (js/dummyData.js). How this made it to the final release is beyond me.

Healthcare.gov is not setting any caching headers, so all assets need to be re-downloaded on every visit. It seems that they intended for the site to work in a completely fluid manner without reloads, but that is clearly not the case. Every refresh (and there are many throughout the process) requires reloading 80+ files, a task that can take 30s or longer and strains healthcare.gov's webservers.

To run (requires nodejs & npm):

git clone https://github.com/STRML/healthcare.gov.git
cd healthcare.gov/marketplaceApp
npm install -g grunt
npm install
grunt build # concat/minification step
grunt connect # runs a webserver to view results

Load Graphs


Before
Before (live site as of Thursday, Oct 10)
Note that the API call to CreateSaml2 triggers an inspector bug - the actual load time is ~28s, not 15980 days
After
This pull request
Load times are from localhost so they are much faster than they would be otherwise. API calls fail because they are relative to the current domain.

Tuesday, November 13, 2012

WPEngine is smooth - but not that smooth

Over the last three months I've been developing a large site for WPEngine. The previous site was hosted on a simple VPS and the customer wanted a faster, more hands-off solution. I like WPEngine for a number of reasons:
  • Hands-off caching
  • Simple staging
  • Simple backups
  • Easy domain configuration
  • Fast support
And so on. Finally, I thought, a service where you get what you pay for (and you do pay for it). And the site is fast. Really fast.

But the problems have started mounting. A quick preview of what I've seen just in the past week:
  • Uploaded images occasionally disappear. I don't know where they go. The user who uploads them sees the images (they go into browser cache) but nobody else can see them. That makes this very difficult for an author to catch.
  • You simply cannot set cookies from PHP. It will not happen. It will happen on your staging server, it will happen on your local server, but WPEngine's caching simply does not allow this. This should be a big red note in WPEngine's support garage. You don't simply disable all cookies and not tell anyone.
  • WP-Cron is broken by default. Posts miss their schedule all the time. I registered a support ticket with WPEngine - they say that they have some internal defaults set that often break wp-cron, and that they would set mine back to fix it. My posts still occasionally miss their schedule. More importantly, why is wp-cron broken by default? Isn't that something important to tell your customers?
  • Weird validation - WPEngine overrides core validation routines to do totally inexplicable things like ban capital letters in usernames. What the hell?! Since WP doesn't ban this by default, you get totally unhelpful validation messages like "Your username must only have alphanumeric characters." Adding a simple filter to sanitize_user fixed this, but why do I have to do that?
  • Staging is a mess. I use Wordless, a great plugin that allows you to use HAML, comes with great helpers, and breaks functions.php into a folder of scripts. This completely doesn't work on staging, and WPEngine has no solution. On top of that, staging does not use the same caching as production, which means that even if my theme did work - I wouldn't catch many of the above bugs.
  • Import limits are agonizingly low. I wanted to import a bunch of posts of a certain category to another site on my WPMU network. Rather than pull the raw SQL, it would be nice to use the WP importer/exporter so I can pull media data, create taxonomies, and so on. After all, that's what it's built for. But WPEngine has a 256MB memory limit (!!!!!) which means that my imports have a paltry 1.6MB limit. What can I do with this? I ended up having to find a Python script to split my imports into manageable chunks. On other hosting I would simply raise the memory limit. This was a real pain.
  • Just as I was writing this post, I lost all of my restore points. Wow.


I'm sure I will find more.

A lot of this would be alleviated by having a simple guide to WPEngine's quirks - like a "What to watch out for in production" article. But no such thing exists. And until the staging environment is a proper staging environment, with the same exact caching, I will continue to find bugs in production & only in production.

Suffice it to say, WPEngine is anything but "Hassle-Free Wordpress Hosting."

Edit: Since this post, I've come across yet another: WPEngine's aggressive site caching actually caches the blogname of my main site onto several of my child sites every morning. That's right, the title of my child sites actually CHANGES every morning and I have to empty the cache daily to set it back. No database writes are done, nothing is wrong in MySQL. This is WPEngine caching my blogname - and getting it wrong. I can't find a GIF to explain how I feel.

Tuesday, October 30, 2012

Uploading a whole directory to a remote server with LFTP

As any user of a restricted VPS or PaaS product (like WPEngine) might know, there are sometimes restrictions that stop you from using ssh or scp.

In my most recent case I really needed to move about 30GB of tiny thumbnail files to WPEngine. I considered downloading them all to my machine with Cyberduck, then uploading them over, but that would have been way too slow. Cyberduck goes into a 'preparing...' cycle that seems to never end, and caps my CPU at 100%.

And in any case, SFTP doesn't support recursive put. What?!

Thankfully, there's LFTP. LFTP is available in just about any sane package manger and works in this fashion:

lftp sftp://user@mysub.domain.com

It uses the same syntax as sftp, plus a few awesome additions.
 
In my case, I wanted to move an entire wp-content/uploads folder over. This is really easy. Simply cd into your existing uploads directory, fire up lftp, and...

lftp user@site.wpengine.com:/wp-content/uploads/2012/10> mirror -R --parallel=20
This is the money shot. mirror -R  means "mirror in reverse", that is, move all local files to remote. And the parallel directive is extremely useful when moving tons of small files. I was seeing incredible (>6MB/sec) transfer rates of these tiny files between sites. And when uploading on my small comcast connection, I was able to saturate the pipe with parallelism this high.

Simply put, there exists no GUI tool that can pull something off this elegantly and quickly. Use it.