PHP at Scale: Efficiently Storing Object Data

This post seeks to discourage two patterns in any PHP code that may ever need to scale:

use of a $properties array to store the attributes of an object
setting dynamic properties (i.e. setting $obj->foo where “foo” is undeclared)

Both of these may seem like innocent, sensible additions at one time without it being clear to developers how grave the consequences are down the line. (skip to a dynamic-property benchmark)

Starting Simple

class Company {
  public $name;
  public $address;
}

This is fine for an introduction to OOP, but ignores real world needs like representing relational data. Consider that the consumer of a Company object may want to know the ids of employees at the company, and that those may need to be loaded from a database. To avoid repeated queries, it makes sense to cache that data. This leads to the addition of:

  protected $employee_ids;
  public function employeeIds() {
    if (!isset($this->employee_ids) {
      $this->employee_ids = /* fetch from DB */;
    }
    return $this->employee_ids;
  }

An unrelated feature may then require the ability to iterate over the attributes of a company, and perhaps only those that would be stored in the “companies” table of a SQL database. That becomes more complicated when the object’s set of properties now includes both attributes of the company itself (name, address), and something derived at runtime from another dataset (employee_ids).

A (Non-) Solution

class Company {
  // with the expectation that $this->properties['name']
  // replaces $this->name
  public $properties;
}

This is by far the simplest solution in that it isolates attributes that need to be iterated over inside a data structure that’s already iterable.

The issue: Hash tables are memory hogs, and every instance of Company now requires a hash table.

Understanding the Impact on Memory

To overly simplify things, if a class has n properties, the interpreter can allocate space for n pointers for each instance of the object. It then only needs a single list of symbols to map each to a particular pointer. Ignoring other data, such as property visibility, this can be illustrated roughly like so:

// mapping symbols to indices
{
  'foo': 0,
  'bar': 1,
  'baz': 2
}

// 3 blocks of memory for object 1
[*ptr_to_foo][*ptr_to_bar][*ptr_to_baz]

// repeat for objects 2 to n-1...

// 3 blocks of memory for object n
[*ptr_to_foo][*ptr_to_bar][*ptr_to_baz]

To find the baz property of object n, the interpreter needs only to find that the index of baz is 2, then add 2b to the first pointer-address of n to find the pointer to the data. Most critically, the memory overhead of mapping the list of properties is only incurred once. That is, where m is the size of the map, and b is the size of the pointer, the total allocation is m + 3nb. It’s important to note that m is larger than b, making it minimally painful to scale n.

When an associative $properties array is introduced, PHP implements this as a hashtable for every instance. The data structure may be represented as:

// $properties of object 1
{
  'foo': *ptr_to_foo_1,
  'bar': *ptr_to_bar_1,
  'baz': *ptr_to_baz_1
}

// repeat for objects 2 to n-1...

// $properties of object n
{
  'foo': *ptr_to_foo_n,
  'bar': *ptr_to_bar_n,
  'baz': *ptr_to_baz_n
}

Each of these encounters overhead that was only needed once in the previous example. Where s is the size of one of these, the overhead becomes sn. In plain language, this is the relevant comparison:

1 big thing plus n small things vs. n big things

With all of this considered, it becomes immediately obvious why $properties arrays are a terrible idea for any sufficiently large set of objects.

An Efficient Solution

The desire is to have an isolated, iterable container for a specific set of data related to the company. Another object that implements things like the Iterator interface is arguably the best way to handle this in PHP. A basic implementation is available on GitHub as part of a test setup that demonstrates just how terrible the memory tradeoff of convenience can be (more on that later). To simply illustrate the solution at a high level:

abstract class Data implements \ArrayAccess, \Iterator, \Countable {
  // refer to the GitHub link above for the body
}

class CompanyData {
  public $name;
  public $address;
}

class Company {
  // $properties is an instance of CompanyData
  public $properties;
}

$c = new Company();

// works thanks to ArrayAccess interface
$c->properties['name'] = 'Widgets Unlimited';
$c->properties['address'] = '123 Main St.';

// works thanks to Iterator interface
foreach ($c->properties as $prop => $value) {}

This grants the desired isolation of attributes about a given company into $properties as well as providing convenient iteration and access, but it does so while avoiding the overhead of an array for each object because each $properties is an efficiently stored instance of a CompanyData class.

Danger Ahead

PHP is overly permissive and that can lead to destroying that memory advantage. Namely, the allowance of dynamic property assignment has a terrible side-effect.

class Animal {
  public $type;
}

$cow = new Animal();
// PHP allows this**
$cow->color = 'brown';

** This generated a notice prior to PHP 8.2, and the error_reporting default hid these notices until 8.0. As of 8.2, it generates a warning.

When PHP encounters this code, it can no longer rely on the map from properties to pointers that makes objects efficient because “color” is not in that map. Instead, a new map is created just for the $cow instance. This is effectively a return to having the overhead of a $properties array for every instance.

Fortunately, version 8.2 deprecated dynamic properties, but with no announced plans for version 9.0, the removal of the feature (*except when using a special class attribute) is far into the future. For now, developers must remember that something being allowed doesn’t make it a good idea.

By the Numbers

That’s a nice bit of theory, but how bad could it really be? Bad. Really bad.

As an illustrative example, two classes (MunicipalityData, PersonData) were created that extend Data. Each is suitable for use as a properties container as outlined earlier in the post. Whereas MunicipalityData only has three properties, the Person class has 32 to illustrate how things scale with more properties.

A benchmark script creates 100,000 objects of a particular class, records the memory usage, then does it again, only this time with setting a dynamic property on each object. It does so for both classes and reports the results.

All of the code is available on GitHub.

The output:

  Testing MunicipalityData (w/ 3 public properties)
    No Dynamic Props: 14.15 MiB, 0.018 seconds
    w/ Dynamic Props: 49.02 MiB, 0.032 seconds
    Memory Penalty: 3.5x
    Time Penalty: 1.8x
  Testing PersonData (w/ 32 public properties)
    No Dynamic Props: 65.05 MiB, 0.048 seconds
    w/ Dynamic Props: 314.52 MiB, 0.138 seconds
    Memory Penalty: 4.8x
    Time Penalty: 2.9x

“Don’t Use PHP Then”

This is a silly response. Legacy code exists, and some new projects are using PHP, even if usage is sliding. Engineers also rarely know how a product is going to evolve. What was once functional with a small number objects may someday be asked to scale by orders of magnitude. Using best practices early on spares a lot of pain. This post was inspired by a real example where a piece of enterprise software was crashing after exhausting a 1 GB memory budget. RAM is a precious resource, especially in high-concurrency environments. Code accordingly.