Sunday, July 27, 2008

Large Integer Handling in Potato

Different Kinds of Squeak Objects

In Squeak, there exist different kinds of objects that differ in how their contents are addressed. An object can, for example, hold references to other objects that are addressed by names that are mapped to slot indices. Most "ordinary" objects are shaped this way. An object can also hold elements (e.g., machine words, integers, or bytes) that are addressed by slot index. Strings and arrays are an example of this: each character or array element is stored at a given position. A third possibility is for an object to hold both references and arrayed elements; e.g., CompiledMethod instances store the literals used in the corresponding method as references, and the bytecodes representing the method code as an indexable byte array.

The format field in an object's header controls the nature of the object's contents. For instance, a value of 12 and greater implies that the object is a CompiledMethod. Depending on the format field's value, the virtual machine interprets the contiguous memory area representing the object in different ways.

In JSqueak, and, consequently, in Potato, it is not straightforwardly possible to simply interpret a collection of slots in such strongly different ways. To the end of supporting all kinds of Squeak objects nevertheless, any Potato object holds two references that are used to store different kinds of data:

class SqueakObject {
...
public Object bits;
...
public Object[] pointers;
...
}


The pointers field is used to store pointers of all kinds. The bits field is used to store indexable binary data. Note that its type is Object: that way, it can store arrays of int values (e.g., for machine-word arrays) and arrays of byte values (e.g., for strings) equally well.

Representing Large Numbers

Squeak features two classes for representing integers that do not fit the 30+1-bit SmallInteger space: LargePositiveInteger and LargeNegativeInteger. Both share the property that they store, in the bits field, a byte array whose elements represent the absolute value of the corresponding number - the signum of the number is indeed determined by the object's type. A Large****tiveInteger's byte array can have any length that is required to represent the number in question. The first array element is the least significant byte of the number's binary representation.

The original JSqueak did not feature primitive implementations for arithmetic operations on Large****tiveIntegers. Instead, in-image functionality would be executed for each such operation. As each and every single bytecode of such in-image functionality has to be evaluated by the interpreter, the lack of primitives for Large****tiveInteger arithmetics most certainly implies a slowdown. This seems unnecessary given that Java, the platform on which JSqueak and Potato are implemented, has strong support for large integers in the form of the java.math.BigInteger class.

Since Large****tiveInteger objects are of a kind that only stores indexed binary data, the pointers reference is unused in such objects. The pointers array hence looks like the natural place to cache a Java BigInteger object shadowing the Squeak Large****tiveInteger instance in question. The concept of shadows was also used in the SPy project, an implementation of the Squeak VM using the PyPy infrastructure. Potato maintains, for Large****tiveInteger objects, a pointers array with size 1, which contains exactly one object: a reference to the shadow BigInteger representing the respective Squeak object.

The large integer primitives implemented in Potato create shadows on demand, i.e., the first time they are actually required to perform an arithmetic operation, for Large****tiveInteger instances that were stored in the image. All Large****tiveIntegers created by the Potato VM are equipped with shadows right away.

Creating a BigInteger shadowing a Large****tiveInteger involves reversing the byte array stored in the Squeak object as Java's BigInteger requires a big-endian representation of the number in question. This functionality is implemented in the potato.objects.LargeInteger class.

Example: Adding Large Integers

Here is how a typical implementation of a large integer primitive looks. The example represents the primitive for adding two large integers; most large integer primitives look very similar.

boolean primitiveLargeIntegerAdd() {
return pop2andPushPossiblyCoercedBigIfOK(
getBig((SqueakObject)stack.stackValue(1)).add(
coerceToBig(stack.stackValue(0))));
}


The two objects in question can be found on the stack. The object at index 1 is self, which is guaranteed to be a Large****tiveInteger. The object at index 0 is the second (right-hand) parameter to the operation; it may be something entirely different, e.g., a SmallInteger, or even an illegal object.

The getBig() method interprets its parameter as a Large****tiveInteger and returns its Java BigInteger representation. This method has a side effect: if the Squeak object does not already hold a cached BigInteger shadow, it is created and stored in the Squeak object. This is arguably not very good style and may be changed in the future.

The coerceToBig() method checks whether the passed object is a SmallInteger and, if so, returns a BigInteger representation thereof. Otherwise, it falls back to getBig().

Having obtained two BigInteger instances, the two are added using java.math.BigInteger's add() method. The result is passed to pop2andPushPossiblyCoercedBigIfOK(), which checks whether the result is actually a Large****tiveInteger or whether it can be coerced to SmallInteger instead. Eventually, the two parameters are popped off the stack, and the correct result object is pushed onto it. If it is a Large****tiveInteger, a shadow BigInteger is created right away and cached in it.

More on Shadows

The shadow concept requires some care. While most primitives yield new Large****tiveInteger objects, two of them deserve special attention. These primitives are #digitAt:put: and #replaceFrom:to:with:startingAt: - they both directly access the byte array stored in the bits field of Large****tiveInteger instances. In case such a primitive is executed, the shadow BigInteger must be updated accordingly, otherwise the two representations - the Squeak one in the byte array and the Java one in the BigInteger stored in the pointers array - are out of sync, which implies all kinds of trouble for the arithmetic primitives that rely on the cached BigInteger.

To overcome this issue, the implementations of said two primitives check whether the object on which they operate is a Large****tiveInteger and update the shadow BigInteger if necessary.

This important issue was pointed out by Carl Friedrich Bolz.

Results

Some superficial benchmarks were implemented to assess the performance of two pieces of functionality: the creation of successive LargePositiveIntegers, and adding up a large number of such objects.

The first benchmark creates 1,000,000 LargePositiveIntegers by successively adding 1. It exhibits an almost ninefold speedup over Potato without the LargeInteger primitives.

The second benchmark fills an array with 500 elements with the sums of these elements. The first two slots contain the two smallest possible LargePositiveIntegers; slot 3, the sum of the first two slots; slot 4, the sum of the first three slots, and so forth. This involves huge amounts of large integer primitives being executed (albeit only addition). This benchmark exhibits a roughly 22-fold speedup.

While these results were obtained using rather simple and straightforward benchmarks that are likely not sustainable, they still show that using the capabilities of the underlying platform in Potato is worthwhile.

Besides, it was fun implementing them.

Status

As of this writing, the large integer support can be found in the LargeInteger branch of the Potato SVN repository at SourceForge. It will be merged into the trunk once it will have undergone some more testing, benchmarking, and refactoring.

Edit: Shadows were not originally used in PyPy, but in SPy. Thanks to Adrian Kuhn for pointing this out.

No comments: