Wednesday, August 29, 2012

Content hosting for the modern web



Our applications host a variety of web content on behalf of our users, and over the years we learned that even something as simple as serving a profile image can be surprisingly fraught with pitfalls. Today, we wanted to share some of our findings about content hosting, along with the approaches we developed to mitigate the risks.

Historically, all browsers and browser plugins were designed simply to excel at displaying several common types of web content, and to be tolerant of any mistakes made by website owners. In the days of static HTML and simple web applications, giving the owner of the domain authoritative control over how the content is displayed wasn’t of any importance.

It wasn’t until the mid-2000s that we started to notice a problem: a clever attacker could manipulate the browser into interpreting seemingly harmless images or text documents as HTML, Java, or Flash—thus gaining the ability to execute malicious scripts in the security context of the application displaying these documents (essentially, a cross-site scripting flaw). For all the increasingly sensitive web applications, this was very bad news.

During the past few years, modern browsers began to improve. For example, the browser vendors limited the amount of second-guessing performed on text documents, certain types of images, and unknown MIME types. However, there are many standards-enshrined design decisions—such as ignoring MIME information on any content loaded through <object> , <embed> , or <applet> —that are much more difficult to fix; these practices may lead to vulnerabilities similar to the GIFAR bug.

Google’s security team played an active role in investigating and remediating many content sniffing vulnerabilities during this period. In fact, many of the enforcement proposals were first prototyped in Chrome. Even still, the overall progress is slow; for every resolved problem, researchers discover a previously unknown flaw in another browser mechanism. Two recent examples are the Byte Order Mark (BOM) vulnerability reported to us by Masato Kinugawa, or the MHTML attacks that we have seen happening in the wild.

For a while, we focused on content sanitization as a possible workaround - but in many cases, we found it to be insufficient. For example, Aleksandr Dobkin managed to construct a purely alphanumeric Flash applet, and in our internal work the Google security team created images that can be forced to include a particular plaintext string in their body, after being scrubbed and recoded in a deterministic way.

In the end, we reacted to this raft of content hosting problems by placing some of the high-risk content in separate, isolated web origins—most commonly *.googleusercontent.com. There, the “sandboxed” files pose virtually no threat to the applications themselves, or to google.com authentication cookies. For public content, that’s all we need: we may use random or user-specific subdomains, depending on the degree of isolation required between unrelated documents, but otherwise the solution just works.

The situation gets more interesting for non-public documents, however. Copying users’ normal authentication cookies to the “sandbox” domain would defeat the purpose. The natural alternative is to move the secret token used to confer access rights from the Cookie header to a value embedded in the URL, and make the token unique to every document instead of keeping it global.

While this solution eliminates many of the significant design flaws associated with HTTP cookies, it trades one imperfect authentication mechanism for another. In particular, it’s important to note there are more ways to accidentally leak a capability-bearing URL than there are to accidentally leak cookies; the most notable risk is disclosure through the Referer header for any document format capable of including external subresources or of linking to external sites.

In our applications, we take a risk-based approach. Generally speaking, we tend to use three strategies:
  • In higher risk situations (e.g. documents with elevated risk of URL disclosure), we may couple the URL token scheme with short-lived, document-specific cookies issued for specific subdomains of googleusercontent.com. This mechanism, known within Google as FileComp, relies on a range of attack mitigation strategies that are too disruptive for Google applications at large, but work well in this highly constrained use case.
  • In cases where the risk of leaks is limited but responsive access controls are preferable (e.g., embedded images), we may issue URLs bound to a specific user, or ones that expire quickly.
  • In low-risk scenarios, where usability requirements necessitate a more balanced approach, we may opt for globally valid, longer-lived URLs.
Of course, the research into the security of web browsers continues, and the landscape of web applications is evolving rapidly. We are constantly tweaking our solutions to protect Google users even better, and even the solutions described here may change. Our commitment to making the Internet a safer place, however, will never waver.