So I've been trying to study Ruby's Marshal binary format. And I've figured out a few things and I'm still trying to figure a couple of things out, maybe some of you guys can help. I've only serialized a few data types and identified a few key bytes.
Here's some code.
class ZObject
attr_accessor :data
def initialize(obj = nil)
@data = obj
end
end
obj = [ZObject.new(true), ZObject.new(false)]
file = File.open("format.bin", "wb")
Marshal.dump([5, 6, 7, 8], file)
Marshal.dump(obj, file)
file.close
Here's the output in bytes and the respective characters
04 08 5B 09 69 0A 69 0B 69 0C 69 0D 04 08 5B 07 - ..[.i.i.i.i...[.
6F 3A 0C 5A 4F 62 6A 65 63 74 06 3A 0A 40 64 61 - o:.ZObject.:.@da
74 61 54 6F 3B 00 06 3B 06 46 - taTo;..;.FI read somewhere that the first two bytes are the Marshal's Major and Minor version. So in this case, It would be 4.8.
Arrays are identified with the byte 0x5B, the byte afterwards is the number of entries in the array. However, what's odd, integers are serialized as 5 higher than they should. 0x09 is the size of the array, subtract 5 and you get 4. Integers are identified with the byte 69 (giggity) and the byte(s) following are the actualy integer. Again, following the 5 higher than the real value. Not sure why it is yet, and I've only done small so far, I haven't done 255 or higher. There are no bytes identified as separators. One thing I found out, everytime Marshal.dump is called, it dumps the version number again, as you'll see, after the first array was dumped, you'll find the bytes 04 and 08 again.
True is identified as a capitol T (0x54) while False, capitol F (0x46).
Now let's look at bytes 6F, 3A, 0C, at the beginning of the 2nd line. I'm not entirely sure, but I've narrowed 6F to tell Ruby it's a serialized class. The byte afterwards 3A, I'm not too sure about, I think it specifies a symbol, class, variable, because right after is the size of the class name/variable name. 0C is the value size of the class name, 0C is 12, minus 5, is 7, which is the number of characters in ZObject. The byte right after ZObject, 06, I'm not quite sure what this one means. When a variable is serialized, the data belonging to the variable is immediately serialized. If an integer, it'll be writtien as an i then the number.
Now onto something that I noticed, if an object has already been serialized in the same Marshal.dump call, it gets more compressed.
6F 3B 00 06 3B 06 46 - o;..;.FThese last bytes in the file are the 2nd ZObject, so it gets more compressed. What I'm trying to understand here, is how these list of bytes is equivalent to the first serialized ZObject. Other than @data being set to true rather than false. Is it generating a checksum?
So I added a new class with the same variable name and discovered something else. When serializing data, to reduce space, it seems to generate a byte declaring the name of a class, symbol, variable.
New Code (Only serializes two objects, one ZObject and one SObject. Each has a variable named @data)
class ZObject
def initialize(obj = nil)
@data = obj
end
end
class SObject
def initialize(obj = nil)
@data = obj
end
end
begin
obj = [ZObject.new(true), SObject.new(false)]
file = File.open("format.bin", "wb")
Marshal.dump(obj, file)
file.close
exit
rescue
puts $!
gets
exit
end
It generates this. (Pay attention to the highlighted areas)
04 08 5B 07 6F 3A 0C 5A 4F 62 6A 65 63 74 06 3A - ..[.o:.ZObject.:
0A 40 64 61 74 61 54 6F 3A 0C 53 4F 62 6A 65 63 - .@dataTo:.SObjec
74 06 3B 06 46 - t.;.F
As you can see, the bytes highlighted in green are @data serialized in the ZObject, however, when declared and serialized in the SObject, they don't come out the same. You can tell they're getting more compressed. I'm still unsure what the byte 06 is identified as (the one that comes right after the last character in a class name).
I'm wondering if anyone can help me figure out or knows anything about how they decide how to compress it.