ARC Data Format Specification
Format Version: 1.0
OverviewThe
ARC Data format in ARC is used for serialization of all data files saved in the Data directory (except for scripts). The file extension for ARC Data files is usually "
.arc". This format is being used because it is simpler than Ruby's Marshal format, uses basically no Ruby-specific data types and thus is easier to port in a different programming or scripting language.
How to use itIn most scenarios you are not required to use the format in any way and in almost no scenario are you required to know the implementation details of the format. In ARC the
ARC::Data module provides the module functions
ARC::Data#load(filename) and
ARC::Data#dump(filename, object). Objects can be serialized one by one or in an array/hash. It is recommended that you use arrays/hashes for data as it makes extracting data just as simple.
Problem regarding shared data and serializationEach time data is loaded or dumped, a header definition and object ID mappings will be created anew. This can create problems if you dump objects separately, keep that in mind. Ideally you should dump everything at once. ARC has this problem as well, but ARC does not allow serialization into an IO stream so it does not affect ARC.
Here is an example of the problem if you could serialize several ARC objects into any kind of IO stream:
Lets assume you have class A and class B with following declarations
class B
att_accessor :text
def initialize
@text = "zero"
end
end
class A
attr_writer :var
def initialize
@var = B.new
end
end
As you can see, an instance of class A contains an instance of class B.
Now let's consider following code:
a1 = A.new
a1.var.text = "one"
a2 = A.new
a2.var.text = "two"
a1 and a2 will have different instances of class B. Now let's assume we use this code to dump some data.
print a1.var.text # prints "one"
print a2.var.text # prints "two"
ARC::Data.dump(file, a1) # "file" has been opened somewhere previously for writing
ARC::Data.dump(file, 2)
And now we use this code to load the data:
a3 = ARC::Data.load(file) # "file" has been opened somewhere previously for reading
a4 = ARC::Data.load(file)
print a3.var.text # prints "one"
print a4.var.text # prints "two"
a3.var.text = "three"
a4.var.text = "four"
print a3.var.text # prints "three"
print a4.var.text # prints "four"
These examples work fine as both A#var instances of class B are different instances.
Now let's take a look at what happens if we have objects that reference shared data.
b = B.new
a1.var = b
a2.var = b
b.text = "awesome"
print a1.var.text # prints "awesome"
print a2.var.text # prints "awesome" as well because it's the same object
After dumping and loading (refer to codes marked as
generic dumping and
generic loading), we execute this code:
print a3.var.text # prints "awesome"
print a4.var.text # prints "awesome"
a3.var.text = "not so awesome"
print a3.var.text # prints "not so awesome"
print a4.var.text # prints "awesome"
As you can see, the shared object of class B is not shared anymore and instead the A#var instances have been loaded as separate objects. This is because serialization of separate objects is treated as separate data. If this method was used for dumping:
array = [a1, a2]
ARC::Data.dump(file, array) # "file" has been opened somewhere previously for writing
And this method for loading:
array ARC::Data.load(file) # "file" has been opened somewhere previously for reading
a1 = array[0]
a2 = array[1]
The shared data would have been preserved as shared data.
print a3.var.text # prints "awesome"
print a4.var.text # prints "awesome"
a3.var.text = "glorious"
print a3.var.text # prints "glorious"
print a4.var.text # prints "glorious"
This problem occurs with Ruby's Marshal format as well and basically with every serialization format that there is.(RMXP solved the problem of the actors in the party by having a Game_Party#refresh method that updates the array of actors from the Game_Actors class)
Again, the ARC Data format has this problem as well, but our implementation avoids it as you can only serialize one object at a time with our interface. This also limits the format currently to be able only to be written into files within ARC. This may change in the future.
File HeaderThe ARC Data format header is fairly simple as it does not require any complex meta data.
- 4 bytes: the characters "ARCD"
- 2 bytes: format version in little endian byte order as unsigned chars to indicate a version (e.g. "1.5" would be the characters 0x1 and 0x5
- 1 serialized object: the root object that has been serialized
Object DataEach serialized "object" has a type identifier and the actual data itself. The serialization begins at the root objects and goes recursively into depth as all objects are mapped by identity (array, hash, object), equality (string) or simply dumped as raw values.
- 1 byte: type identifier
- X bytes: data
The type identifiers are following:
- none-type: 0x10
- false-type: 0x11
- true-type: 0x12
- 32-bit int: 0x21
- big int: 0x22
- float: 0x23
- string: 0x30
- array: 0x40
- hash: 0x41
- object: 0x00
Depending on the type, data can be of different sizes. The substructure of data is determined by the type and contains information about the size of the data itself.
none-typeThe data is 0 bytes.
false-typeThe data is 0 bytes.
true-typeThe data is 0 bytes.
32-bit int
- 4 bytes: plain 32-bit int stored in little endian byte order (Ruby's pack format 'V')
big int
- 4 bytes: plain 32-bit int stored in little endian byte order (Ruby's pack format 'V')
float
- 4 bytes: single-precision floating point stored in little-endian byte order (Ruby's pack format 'e')
string
- 4 bytes: string ID (32-bit int)
- 4 bytes*: size of the string in bytes (32-bit int)
- X bytes*: string as UTF8 stream
While strings do possess an ID, they are all separate objects. The ID is only used to reduce the file size by storing reoccurring strings only once and referencing them during loading. Each string created during loading is a new string and not shared data among other objects.
* see
Implementation Details for the special exception of this case
array
- 4 bytes: array ID (32-bit int)
- 4 bytes*: size of the array in elements (32-bit int)
- X elements*: serialized objects
Each element in the array is a serialized "object" with a type identifier and the actual data.
* see
Implementation Details for the special exception of this case
hash
- 4 bytes: hash ID (32-bit int)
- 4 bytes*: size of the hash in key-value pairs (32-bit int)
- X * serialized objects*: key-value pairs
Each key-value pair consists of two serialized "objects" of which each has a type identifier and the actual data.
* see
Implementation Details for the special exception of this case
object
- 9+X bytes: string stored as serialized object (cannot be empty string, hence minimum size of 9 bytes which includes ID, size and minimally 1 character)
- 4 bytes: object ID (32-bit int)
The string that is stored (like any other string object) represents that class name path with a "::" string as separator for module/class names (e.g. RPG::Event::MoveRoute).
If the object class has
_arc_dump and/or
_arc_load defined (depending on whether ARC::Data#dump or ARC::Data#load has been called):
- 4 bytes*: size data package (32-bit int)
- X bytes*: raw data (usually created by _arc_dump and loaded by _arc_load)
If the built-in serialization method is used:
- 4 bytes*: number of instance variables (32-bit int)
- X * serialized objects*: string-object pairs
Each string-object pair consists of two "objects" of which each has a type identifier and the actual data. The first one is a mandatory string containing the name of the instance variable (without Ruby's "@" prefix) and the second can be any data type that is supported for serialization. The string-object pairs are sorted alphabetically by the strings (the instance variables' names).
* see
Implementation Details for the special exception of this case
Implementation DetailsThe ID maps themselves are simple arrays where the ID is the index of the object. Of course internal mapping can be created using a key-value container instead of a simple array. ID start at 1 and are incremented by 1 for each new mapped object. This ensures compatibility between implementations that internally use arrays and key-value containers. The ID 0 should never appear and indicates a problem.
During reading, maps of IDs are created while the serialized objects are being parsed. Each time an ID is read the first time, the object data follows. If the ID was already read previously, it's a reference to an already existing object.
During writing, objects are mapped as well. When an object is being dumped that was previously already mapped, it is enough to simply dump the ID, omit all data (since it has been already dumped previously under the same ID) and then move on to the next object. The implementation of the reader should follow this example (as previously explained) and map an object when its ID is read the first time and then always only accessed on consequent readings of the same ID.
The ARC Data format handles reference structures of simple trees, trees with loops and trees with recursion while maintaining object identity because of this object mapping method. This means that the ARC Data format fulfills all requirements for full serialization functionality.
Additional Notes
- Ruby's Bignum (and hence big int) is not properly supported and is converted to a 32-bit int. This limitation is there because many languages do not support a variable byte-count int like Ruby's Bignum. Support of 64-bit int instead or full support of a variable byte-count int may be included in the future.
- Double precision Float is not properly supported. This limitation is there because of the rare need for double precision floating point decimal numbers. Support may be included in the future.