How To Store Raw HTML in MySQL

So for whatever reason you need to store raw HTML in a MySQL database. Here's the TL;DR on how we do it over on the Media Oversight Project, where we chew through several thousand HTML core samples daily.

  1. Choose a text-based data type based on your length requirements. If you're storing raw HTML as found in the wild, MEDIUMTEXT or LONGTEXT is the safest option (this is what WordPress uses, for example). For controlled scenarios where you know the document length, you may be able to get away with VARCHAR(n).

  2. Set the column collation to utf8mb4 to avoid potential encoding issues around multibyte characters, like emoji, that frequently appear in HTML documents.

  3. Inform your MySQL library that you're using true UTF-8, i.e., utf8mb4. How to do this, and whether it's necessary, will vary from environment to environment.

  4. Make sure your code actually uses UTF-8. If your application creates HTML documents, create them in UTF-8 without BOM. If it consumes HTML documents, check the encoding and convert if necessary.

  5. Make sure to properly escape your SQL especially when inserting raw HTML. Use a MySQL-provided or MySQL-aware escape routine or, better yet, an ORM that handles escaping for you.

The only non-intuitive bit is that if you want full UTF-8 support in MySQL, you should strongly consider using utf8mb4 in lieu of utf8.

Why UTF-8? Why utf8mb4?

UTF-8 makes a lot of sense for HTML storage because because it's optimized for the kinds of text that actually appear on most Western websites while being expressive and powerful enough to represent any other language or alphabet for which a Unicode representation exists. The first line of the Wikipedia entry for UTF-8:

UTF-8 is a character encoding capable of encoding all possible Unicode code points.

So the thing is, most HTML documents are already UTF-8 encoded (under HTML 5, UTF-8 is the default W3C-recommended encoding) and documents in other encodings can be losslessly stored as UTF-8.

The only wrinkle is that MySQL's nominal utf8 collation can't store all the legal symbols that can appear in a normal UTF-encoded HTML document. Mathias Bynens explains:

Turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported.

Luckily we can use MySQL's utf8mb4 for proper UTF-8 handling in all cases (including for stuff like emoji).

What about BLOBs?

In MySQL the BLOB and TEXT types are closely related in that both are technically string types. BLOBs have the binary collation and charset while TEXT columns have a non-binary charset like utf8 (or utf8mb4 if you're following this guide). So it's possible to store HTML using a BLOB. But in practice there are some good reasons why you shouldn't.

  • It's difficult for external tools (and people) to make assumptions about BLOB columns. External tools have no way of knowing if that BLOB is actually good old plain text or the innards of some binary JPEG. When you use TEXT instead, they know.

  • BLOBs are usually treated as "opaque" entities in database tools like MySQL Workbench, making them a smidge more difficult to work with, query against, and the like.

  • On the coding side, MySQL database libraries, wrappers, and ORMs will happily convert strings to VARCHAR or MEDIUMTEXT as necessary but will often expect you to pass a binary array for a BLOB, or some other BLOB-friendly coding structure. Depending on your language and its string facilities, this will likely require additional massaging of some sort.

The lesson here is: don't be too clever. If it looks like text, acts like text, walks like text, and gets handed in code to you as text, store it as text in the database. Just think precisely about encoding and test with a wide range of HTML document specimens.