Marshaling

From RubySpec

Jump to: navigation, search

Ruby's runtime supports the notion of marshaling. Objects can be serialized into a string, which can then be used to reconstitute the objects at a later time. The Marshal module is responsible for handling this transformation.

Contents

The Marshal File Format

As of Ruby 1.8.4, the Marshal format is itself at version 4.8 and is completely implemented in Ruby's marshal.c.

Preamble

Each stream has a two-byte preamble which specifies the major and minor versions of the stream.

As of ruby 1.8 the preamble will read "\004\010". (0x0408 in hex)

Type Codes

Before the contents of the type are serialized, a single byte is written denoting the type. Marshal has type codes corresponding to each of the various internal structs found in ruby.h, as well as some optimization codes (such as the symlinks described below.)

Most type codes are quite readable and correspond to Ruby's syntax for the type. For example, the type code for an Array is an open square bracket ([) and the type code for a symbol is a colon (:.)

For a full list of type codes, see the Table of Type Codes at the end of this document.

Custom Types

Classes that do not have a type code and that do not define custom marshalling methods marshal out using the "o" or "C" type codes. Classes that extend from Object use "o", where classes that extend from other builtin types that have their own type codes use "C". The format for both is as follows:

<o or C><symbol or link to class name><object data>

The object data normally just contains a list of instance variables. However in the "C" case, the first part of the object data will be the native data for the original builtin type, such as "[" data if the custom type extends Array. This allows the original native contents of the object to marshal along with additional custom data added in the form of instance variables, and is the reason for a separate "C" identifier for such types.

Normal singleton objects can't be marshaled, and will raise a TypeError stating "singleton can't be dumped". However, if you specify user-marshal methods, then the resulting output will use the original type of the singleton. The singleton type will not be present when the data is marshalled back into Ruby, however.

Object Contents

Following each type code is a packed integer (described below) indicating the size or value of the object and, if needed, a dump of the object's contents. Often the contents are expressed as further type codes with their own contents.

For example, the symbol :koichi can be expressed as: ":\vkoichi".

The type code is the colon (indicating a Symbol), the packed integer length is a byte representing the number 6, followed by the object contents: six characters koichi.

Some objects (such as nil, true and false) have no value and end at the type code. Other objects will have no contents, such as Fixnums and Symlinks, which both can be expressed in the packed integer value.

Integer Packing

A big part of the marshal format is its packing of integers, which is designed to be efficient for numbers less than 122 and greater than -123, but can ultimately store any digit between Ruby's LONG_MIN and LONG_MAX defs. (In Marshal's code, this format is described by the w_long and r_long functions.)

Unpacking works by reading a single byte as a signed char. (To sign: (n ^ 128) - 128.) Using this byte, you can follow these rules:

  • Zero is itself.
  • One through four indicate a larger numeral, read n more bytes.
  • 6 through 127 represent 1 through 122, respectively. Subtract 5.
  • Negative one through four indicate a larger numeral, read n.abs more bytes. The resulting value should be negative.
  • -6 through -128 represent -1 through -123, respectively. Add 5.

The larger numerals are packed as a series of bytes in little-endian order. One way to unpack this would be: (nbytes + "\0\0\0").unpack('V').first. The strings "i\005" and "i\373" (5 and -5) are unmarshalled as 0 using the same rules above as for 6 and -6, but should not be produced by marshalling code as the representation "i\0" is preferred.

Symbols and Symlinks

The Marshal format only stores each unique Ruby symbol once. So, if, say, the symbol :matz gets stored, the Marshal file will write ":\tmatz" the first time the symbol is encountered in an object. Subsequent occurences of the symbol will be the simple symlink bytecode (a semicolon).

Continuing the example, if :matz was the second symbol to appear in the dump, the symlink would read ";\006", meaning this is a symlink to the second symbol appearing in this dump.

To verify in irb:

 >> Marshal.dump([:koichi, :matz, :matz])
 => "\004\b[\b:\vkoichi:\tmatz;\006"

Code which attempts to load partial objects in which symlinks are broken will throw an ArgumentError: bad symbol.

Extended Builtin Classes

When extending Object, the "o" type code is used, and the default marshaling behavior for the "o" type code is used as described in the table below.

When extending the other builtin classes that have their own type codes and marshal formats, the "C" typecode is used. In these cases, after the "C" comes the class name, like in the "o" type code case. After the class name comes the data for the builtin parent class. Then comes instance variable data for the child class. For example, in the following example:

class MyArray < Array
  def initialize; @foo = "hello"; end
end
p Marshal.dump(MyArray.new)

the output would be:

"\004\bIC:\fMyArray[\000\006:\t@foo\"\nhello"
 |     |||||       ||   |   || |   | | |
 |     |||||       ||   |   || |   | | string data
 |     |||||       ||   |   || |   | size of string
 |     |||||       ||   |   || |   string type code (a `"' character)
 |     |||||       ||   |   || symbol string data
 |     |||||       ||   |   |size of symbol string
 |     |||||       ||   |   symbol type code (a `:' character)
 |     |||||       ||   count of instance vars
 |     |||||       |size of the array (0 in this case)
 |     |||||       array type code (a `[' character
 |     ||||symbol string data
 |     |||size of symbol string
 |     ||symbol type code (a `:' character)
 |     |user-defined class type code (a `C' character)
 |     instance variable specifier, to specify there are instance vars
 preamble

Pseudocode for Marshalling

Here is some pseudocode for the marshalling process. This should be updated as necessary to be correct.

if object does not define custom marshal_dump method
  if object is a simple Object instance or extends Object
    marshal with `o' type code and instance vars
  if object is a simple builtin class instance that has its own typecode
    marshal with that typecode according to the table below
  if object extends a builtin class
    if object has instance vars
      output `I' type code
    end
    marshal with `C' type code, followed by actual class name
    marshal using superclass's logic and typecode
    if object has instance vars
      marshal instance var count (+5, as in \006 for 1 var), followed by each var's symbol and value in turn
    end
  end
else
  call custom marshal_dump method on object to dump data
end

Table of Type Codes

Type Ascii code Contents
Nil '0' None
true 'T' None
false 'F' None
Fixnum 'i' A packed integer containing the value.
String '"' A sequence of bytes prefixed with the string length, as a packed integer.
Symbol ':' Same as above.
Symlink ';' A packed integer referring to an earlier symbol. (See Symbols and Symlinks.)
Regexp '/' A string representation of a regular expression.
Array '[' A packed integer containing the length of the array, followed by each object in the series.
Hash '{' A packed integer with the hash size, followed by each key/value pair of objects.
uclass 'C' The name of the class, stored like a symbol.
Extended 'e' An extended class
Object 'o' The name of the class (stored as a symbol) and a dictionary of instance variables and their values (stored like a hash.)
Data 'd' Possibly the DATA section of a ruby program, needs to be checked first.
_dump(), _load() 'u' User specified dump/load semantics
marshal_dump(), marshal_load() 'U' User specified marshal_dump/marshal_load semantics
Float 'f' Float value in decimal ascii format, e.g. 0.25. Stored like a string.
Bignum 'l' Integer value larger than 32bit integer. It contains the following information: sign +/-, length (in short), sequence of short integers.
HashDef '}' HashDef is deserialized just like a Hash, but with an additional object following it to be used as the Hash's default value.
Struct 'S' TODO
Module (OLD) 'M' TODO
Class 'c' TODO
Module 'm' TODO
Ivar 'I' TODO
Ivar '@' TODO
Personal tools