ARC Data Format Specification

Started by Blizzard, May 27, 2012, 07:49:34 am

Previous topic - Next topic

Blizzard

May 27, 2012, 07:49:34 am Last Edit: July 08, 2012, 04:33:14 am by Blizzard
ARC Data Format Specification
Format Version: 1.0





Overview

The ARC Data format in ARC is used for serialization of all data files saved in the Data directory (except for scripts). The file extension for ARC Data files is usually ".arc". This format is being used because it is simpler than Ruby's Marshal format, uses basically no Ruby-specific data types and thus is easier to port in a different programming or scripting language.




How to use it

In most scenarios you are not required to use the format in any way and in almost no scenario are you required to know the implementation details of the format. In ARC the ARC::Data module provides the module functions ARC::Data#load(filename) and ARC::Data#dump(filename, object). Objects can be serialized one by one or in an array/hash. It is recommended that you use arrays/hashes for data as it makes extracting data just as simple.




Problem regarding shared data and serialization

Each time data is loaded or dumped, a header definition and object ID mappings will be created anew. This can create problems if you dump objects separately, keep that in mind. Ideally you should dump everything at once. ARC has this problem as well, but ARC does not allow serialization into an IO stream so it does not affect ARC.

Here is an example of the problem if you could serialize several ARC objects into any kind of IO stream:

Spoiler: ShowHide
Lets assume you have class A and class B with following declarations

class B
 
 att_accessor :text
 
 def initialize
   @text = "zero"
 end
 
end

class A
 
 attr_writer :var
 
 def initialize
   @var = B.new
 end

end


As you can see, an instance of class A contains an instance of class B.
Now let's consider following code:

a1 = A.new
a1.var.text = "one"
a2 = A.new
a2.var.text = "two"


a1 and a2 will have different instances of class B. Now let's assume we use this code to dump some data.

print a1.var.text # prints "one"
print a2.var.text # prints "two"

Code: generic dump
ARC::Data.dump(file, a1) # "file" has been opened somewhere previously for writing
ARC::Data.dump(file, 2)


And now we use this code to load the data:

Code: generic load
a3 = ARC::Data.load(file) # "file" has been opened somewhere previously for reading
a4 = ARC::Data.load(file)

print a3.var.text # prints "one"
print a4.var.text # prints "two"
a3.var.text = "three"
a4.var.text = "four"
print a3.var.text # prints "three"
print a4.var.text # prints "four"


These examples work fine as both A#var instances of class B are different instances.
Now let's take a look at what happens if we have objects that reference shared data.

b = B.new
a1.var = b
a2.var = b
b.text = "awesome"
print a1.var.text # prints "awesome"
print a2.var.text # prints "awesome" as well because it's the same object


After dumping and loading (refer to codes marked as generic dumping and generic loading), we execute this code:

print a3.var.text # prints "awesome"
print a4.var.text # prints "awesome"
a3.var.text = "not so awesome"
print a3.var.text # prints "not so awesome"
print a4.var.text # prints "awesome"


As you can see, the shared object of class B is not shared anymore and instead the A#var instances have been loaded as separate objects. This is because serialization of separate objects is treated as separate data. If this method was used for dumping:

array = [a1, a2]
ARC::Data.dump(file, array) # "file" has been opened somewhere previously for writing


And this method for loading:

array ARC::Data.load(file) # "file" has been opened somewhere previously for reading
a1 = array[0]
a2 = array[1]


The shared data would have been preserved as shared data.

print a3.var.text # prints "awesome"
print a4.var.text # prints "awesome"
a3.var.text = "glorious"
print a3.var.text # prints "glorious"
print a4.var.text # prints "glorious"


This problem occurs with Ruby's Marshal format as well and basically with every serialization format that there is.
(RMXP solved the problem of the actors in the party by having a Game_Party#refresh method that updates the array of actors from the Game_Actors class)


Again, the ARC Data format has this problem as well, but our implementation avoids it as you can only serialize one object at a time with our interface. This also limits the format currently to be able only to be written into files within ARC. This may change in the future.




File Header

The ARC Data format header is fairly simple as it does not require any complex meta data.


  • 4 bytes: the characters "ARCD"

  • 2 bytes: format version in little endian byte order as unsigned chars to indicate a version (e.g. "1.5" would be the characters 0x1 and 0x5

  • 1 serialized object: the root object that has been serialized






Object Data

Each serialized "object" has a type identifier and the actual data itself. The serialization begins at the root objects and goes recursively into depth as all objects are mapped by identity (array, hash, object), equality (string) or simply dumped as raw values.


  • 1 byte: type identifier

  • X bytes: data



The type identifiers are following:


  • none-type: 0x10

  • false-type: 0x11

  • true-type: 0x12

  • 32-bit int: 0x21

  • big int: 0x22

  • float: 0x23

  • string: 0x30

  • array: 0x40

  • hash: 0x41

  • object: 0x00



Depending on the type, data can be of different sizes. The substructure of data is determined by the type and contains information about the size of the data itself.

none-type

The data is 0 bytes.

false-type

The data is 0 bytes.

true-type

The data is 0 bytes.

32-bit int


  • 4 bytes: plain 32-bit int stored in little endian byte order (Ruby's pack format 'V')



big int


  • 4 bytes: plain 32-bit int stored in little endian byte order (Ruby's pack format 'V')



float


  • 4 bytes: single-precision floating point stored in little-endian byte order (Ruby's pack format 'e')



string


  • 4 bytes: string ID (32-bit int)

  • 4 bytes*: size of the string in bytes (32-bit int)

  • X bytes*: string as UTF8 stream



While strings do possess an ID, they are all separate objects. The ID is only used to reduce the file size by storing reoccurring strings only once and referencing them during loading. Each string created during loading is a new string and not shared data among other objects.

* see Implementation Details for the special exception of this case

array


  • 4 bytes: array ID (32-bit int)

  • 4 bytes*: size of the array in elements (32-bit int)

  • X elements*: serialized objects



Each element in the array is a serialized "object" with a type identifier and the actual data.

* see Implementation Details for the special exception of this case

hash


  • 4 bytes: hash ID (32-bit int)

  • 4 bytes*: size of the hash in key-value pairs (32-bit int)

  • X * serialized objects*: key-value pairs



Each key-value pair consists of two serialized "objects" of which each has a type identifier and the actual data.

* see Implementation Details for the special exception of this case

object


  • 9+X bytes: string stored as serialized object (cannot be empty string, hence minimum size of 9 bytes which includes ID, size and minimally 1 character)

  • 4 bytes: object ID (32-bit int)



The string that is stored (like any other string object) represents that class name path with a "::" string as separator for module/class names (e.g. RPG::Event::MoveRoute).

If the object class has _arc_dump and/or _arc_load defined (depending on whether ARC::Data#dump or ARC::Data#load has been called):


  • 4 bytes*: size data package (32-bit int)

  • X bytes*: raw data (usually created by _arc_dump and loaded by _arc_load)



If the built-in serialization method is used:


  • 4 bytes*: number of instance variables (32-bit int)

  • X * serialized objects*: string-object pairs



Each string-object pair consists of two "objects" of which each has a type identifier and the actual data. The first one is a mandatory string containing the name of the instance variable (without Ruby's "@" prefix) and the second can be any data type that is supported for serialization. The string-object pairs are sorted alphabetically by the strings (the instance variables' names).

* see Implementation Details for the special exception of this case




Implementation Details

The ID maps themselves are simple arrays where the ID is the index of the object. Of course internal mapping can be created using a key-value container instead of a simple array. ID start at 1 and are incremented by 1 for each new mapped object. This ensures compatibility between implementations that internally use arrays and key-value containers. The ID 0 should never appear and indicates a problem.

During reading, maps of IDs are created while the serialized objects are being parsed. Each time an ID is read the first time, the object data follows. If the ID was already read previously, it's a reference to an already existing object.
During writing, objects are mapped as well. When an object is being dumped that was previously already mapped, it is enough to simply dump the ID, omit all data (since it has been already dumped previously under the same ID) and then move on to the next object. The implementation of the reader should follow this example (as previously explained) and map an object when its ID is read the first time and then always only accessed on consequent readings of the same ID.

The ARC Data format handles reference structures of simple trees, trees with loops and trees with recursion while maintaining object identity because of this object mapping method. This means that the ARC Data format fulfills all requirements for full serialization functionality.




Additional Notes


  • Ruby's Bignum (and hence big int) is not properly supported and is converted to a 32-bit int. This limitation is there because many languages do not support a variable byte-count int like Ruby's Bignum. Support of 64-bit int instead or full support of a variable byte-count int may be included in the future.

  • Double precision Float is not properly supported. This limitation is there because of the rare need for double precision floating point decimal numbers. Support may be included in the future.


Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

ForeverZer0

I have somewhat related question about this. Wouldn't something serialized using a x64 machine produce incorrect results, since the integer size is 8 bytes instead of 4 bytes?

I played around and output a few files using 64bit and was getting errors for being the wrong version. It was saying the version was 65.82 when I wrote the version as ""\x01\x00"" (after writing the "ARCD" header). I never even got so far as the data actually getting deserialized. I also noted that the output files were exactly twice as big, to the exact byte if I remember correctly.

I wasn't sure if there was a workaround to this, or if it required a x86.
I am done scripting for RMXP. I will likely not offer support for even my own scripts anymore, but feel free to ask on the forum, there are plenty of other talented scripters that can help you.

Blizzard

May 30, 2012, 01:51:15 am #2 Last Edit: May 30, 2012, 01:52:18 am by Blizzard
I have a x64 machine and I never had problems with that.
ARC's Game.exe is compiled as a 32-bit application and it will work as such even if it runs on a 64-bit machine. In which language exactly did you use this? The thing is that I made sure even in the Ruby and the Python implementations to force a 32-bit integer during writing.

If you made something on your own, make sure to compile it as a 32-bit application or make sure that you are dumping it forcibly as a 32-bit int.
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

ForeverZer0

After doing some more exploring, a lot of cursing, and thorough testing, I have come to really like this format.
To answer your question, I was attempting using it with IronRuby, which was proving very problematic. For some reason or another, using the exact same scripts as I did with C Ruby, it would screw up the format.

After a lot of screwing around and even having to change the scripts around a bit (IronRuby apparently is missing a few of the destructive method of Enumerable...), I got it to finally work, only to discover that it was painfully slow. Like really slow, as in "Animations" would take nearly 10 minutes to dump. After that let down, I scrapped messing with IronRuby altogether, and tried implementing a C# version, which I am proud to say is successfully loading and dumping .arc files. 

I admit that I didn't pay much attention to you other two when you were developing the serialization method, but I must say, you two did a very nice job developing it. Its elegantly simple, fast, and efficient. Not only that, but we now have four different languages that are capable of utilizing it: C++, Python, C#, and of course Ruby.

I think that's pretty neat. :)
I am done scripting for RMXP. I will likely not offer support for even my own scripts anymore, but feel free to ask on the forum, there are plenty of other talented scripters that can help you.

Blizzard

June 03, 2012, 02:53:48 am #4 Last Edit: June 03, 2012, 02:59:57 am by Blizzard
Yeah, we have noticed a huge slowdown with Animations as well. Though it didn't take 10 minutes. It took maybe 10 seconds with a fresh Animations.rxdata from a new project. I guess your Ruby implementation is slow because Ruby is slow. Our C++ implementation works much faster. Though, the implementation was created using high level coding so it's probably slower than a pure C implementation with most basic functioning would be.
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

ForeverZer0

I was surprised about the slowness. For nearly everything else, IronRuby is as fast as or even faster than C Ruby. I was testing in interactive Ruby, and while there was a little slowdown with animations, it was nothing unexpected or to worry about.  I then used the exact same script and method with IronRuby, and was multiple times slower. When it got to Animations, I thought it was frozen, and didn't finally see how slow it was until I just put it on and went did other things while it dumped, only to come back 10 minutes later to see it was not yet done.  :???:

I still don't know what the hell was causing the slowdown, it can use Marshal very quickly, which is likely similar processing. IronRuby is far from perfect or even a 100% version of pure Ruby, but I that was unacceptable. My opinion of IronRuby is quickly falling. I seem to run into new issues with it every time I try to do something new with it, and its becoming more of a headache to use than anything else. I can only imagine the problems that would have arisen if the original concept of using IronRuby embedded in an engine would have been attempted...
I am done scripting for RMXP. I will likely not offer support for even my own scripts anymore, but feel free to ask on the forum, there are plenty of other talented scripters that can help you.

Blizzard

June 03, 2012, 03:17:16 am #6 Last Edit: June 03, 2012, 03:37:20 am by Blizzard
The final structure is a bit different, but yes, this format is actually very similar to Marshal. I was surprised at the slowness in Ruby myself. These are some arbitrary benchmarks:


  • Marshal read: 0.05 s

  • Marshal write: 0.8 s

  • ARCD read: 0.8 s

  • ARCD write: 10 s



So our Ruby implementation of ARCD is about 10 times slower than the C implementation of Marshal. I didn't benchmark our C++ implementation, but while I was testing it recently for bugs, etc., the writing was done in 2-3 seconds I think.

EDIT: I updated the section "Problem regarding shared data and serialization" a bit. I realized that the serialization problem affects this format as well. It's just that we avoid it in ARC.

EDIT: I remember that reading and writing was significantly sped up when we didn't use mapping of objects (which obviously causes problems as soon as there are multiple objects referencing one specific object). Using hashes turned out to be slower than arrays as well (if I remember right). I wonder if there is a way to speed mapping up.

EDIT: Yeah, Ruby hashes were slower than arrays, but not much. Maybe a slowdown of 10%. We interfaced the mapping so switching everything from arrays to hashes takes like a minute or two. :>
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

ForeverZer0

The C# implementation I made is fast, though I have not done any benchmarking with it. I differed a bit from the Python and C++ version of using strings, and instead use only arrays of bytes. This eliminates the need of using a pack and unpack method, but I don't really know how much effect that has, strings are essentially just arrays of bytes anyways. One minor bottleneck I had to do was use a regular expression for the class name. Reflection makes it easy to grab all the possible types defined in the assemblies, but it handles nested classes "names" in a strange way (when using .ToString() on a Type). I can't do a simple find-replace of "::" to ".", since nested classes such as RMXP's "RPG::Event::Page::Condition", use a "+" sign in this case. The string name of the type ends up being "RPG.Event+Page+Condition", so I have to replace all the "::" with "[\.|\+]" and then try to make a match to find the type.

Another stupid thing is the fact that RMXP uses "int" as a parameter name, which of course is a reserved word in C#, and many other languages. I have to check every single property to ensure that it if it is "int" to change it to "intel" when loading objects. While this may not be too bad on performance, it really irks me.
I am done scripting for RMXP. I will likely not offer support for even my own scripts anymore, but feel free to ask on the forum, there are plenty of other talented scripters that can help you.

Blizzard

It seems to me that the best way to make it work is maybe a meta class.

That replacing thing is weird. Are you sure that you passed the proper parameters to the replace method? Worst case, you could use split/join.
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.

ForeverZer0

I have to find a type by name. C# handles string names of a class within a class by separating with a plus sign. That's how it supposed to be unlike namespaces are separated with periods. If you try to replace a plus sign with a period and try to tell it to find it by name using Reflection, it will not work.

Here's what I mean:

		private static Type GetMappedType(string classPath)
{
if (_mappedTypes.ContainsKey(classPath))
return _mappedTypes[classPath];
else
{
Regex regex = new Regex(String.Join(@"[\+|\.]", Regex.Split(classPath, "::")));
foreach (string typeName in _assemblyTypes.Keys)
{
if (regex.Match(typeName).Success)
{
_mappedTypes[classPath] = _assemblyTypes[typeName];
return _mappedTypes[classPath];
}
}
}
throw new TypeLoadException(String.Format("Type of \"{0}\" cannot be found in loaded assemblies.", classPath));
}
I am done scripting for RMXP. I will likely not offer support for even my own scripts anymore, but feel free to ask on the forum, there are plenty of other talented scripters that can help you.

Blizzard

June 03, 2012, 05:13:13 pm #10 Last Edit: June 03, 2012, 05:23:12 pm by Blizzard
Fuck. ._. I didn't know about the "+" thing.

But if you remember Ryex's Python implementation, he had to implement a different handling for classes there as well. Luckily it was mostly just substituting the "::" for ".". That reminds me that I forgot to add the object class name in the specs. >.<
Check out Daygames and our games:

King of Booze 2      King of Booze: Never Ever
Drinking Game for Android      Never have I ever for Android
Drinking Game for iOS      Never have I ever for iOS


Quote from: winkioI do not speak to bricks, either as individuals or in wall form.

Quote from: Barney StinsonWhen I get sad, I stop being sad and be awesome instead. True story.