X
41264 Rate this article:
3.2

Floating Point in JSON Without Loss of Precision

Anonym

I was recently having a debate with a colleague about the pros and cons of JSON as a data storage and transport format.  Its simplicity and efficiency, at least compared to XML, have helped make it one of the dominant technologies for client/server communication, and even for data/metadata storage.  But there is one potential downside in that it is a text format, not a binary format, so there will inevitably be a loss of precision when floating point numbers are encoded in a JSON stream.  The conversion of number to string or string to number can result in truncation or round-off errors, especially if different languages are used for the two conversions.

What we need is to express the floating point numbers in their binary form, or an equivalent form, when encoded in JSON.  We have a few options here, which vary in encoding efficiency and implementation complexity.  I show how to encode and decode each of these options in IDL, starting with the simplest but least efficient.  Note that I do not do anything about Endianness, so if you need to communicate between big- and little-Endian systems then you will need to modify this code to always encode the values in a given Endianness and then handle the conversion for the other system.

 

Float to Byte Array

IDL has to ability to convert a Float or Double into an array of Bytes, like reinterpret_cast<> in C++.  The help is a little confusing in their explanation, but you can specify an offset and set of dimensions to the typecasting functions BYTE(), FIX(), LONG(), …  When you do this, it goes to the block of memory storing the Float(s) and just treats them as an array of Bytes instead.  Note that the Offset argument is in the output datatype index, not input datatype, as are the dimension arguments:

IDL> byte(!pi)
   3
IDL> byte(!pi, 0)
 219
IDL> byte(!pi, 1)
  15
IDL> byte(!pi, 2)
  73
IDL> byte(!pi, 3)
  64
IDL> byte(!pi, 0, 4)
 219  15  73  64

We can see how calling Byte() with only one argument performs a typecast coercion, making Pi the Byte value 3.  But once we start using the Offset argument we grab each Byte from the block of 4 used to store the Float.  Using two additional arguments allows us to get the array of 4 Bytes.  Converting back to Float is similarly accomplished:

IDL> float([219b, 15b, 73b, 64b], 0, 1)
       3.1415927

Using these functions we can build recursive functions to encode and decode Lists and Hashs that represent parsed JSON documents.  The encode function iterates through List and Hash objects, recursing on List and Hash elements, copying non-floating point elements, and converting Float and Double scalars and arrays.  While we could just replace the floating point numbers with arrays of Bytes, we don’t want to confuse these encoded forms with normal Byte arrays.  So we create a new Hash object that stores the Byte array under a “bytes” key and the type of encoding under an “encoding” key.  The value of the encoding key is the name of function we use to decode the Byte array back to Float or Double.  This behavior could be extended to complex numbers, which aren’t natively supported by JSON at all.  The decode function is similar to the encode function, iterating and recursing on List and Hash elements.  The main difference is that when it finds a Hash with the “encoding” and “bytes” keys in it it invokes the appropriate function to convert the Byte arrays back into Float or Double values.  The encoding of the floating point values as Byte arrays always uses the number of Bytes needed as the first dimension, so a scalar Float becomes a 4-element Byte array and a [M, N]  array of Floats becomes a [4, M, N] array of Bytes.  If a scalar Float or Double is passed in, the Size(/DIMENSION) function will return 0, so we use the ternary operator to set the input dimensions to a scalar 1.  This means that the dimensions of the Byte array is [4, 1] or [8, 1], which IDL automatically converts down to 1D arrays.  On decoding, we have simple validation that the first dimension is the correct number (4 or 8), and then strip it off to get the dimensions of the floating point values.

Here is the full code:

function ByteArrayToFloat, bytes
  compile_opt idl2
 
  dims = Size(bytes, /DIMENSION)
  if (dims[0] ne 4) then Message, 'Invalid byte array, first dimension must be 4'
  floatDims = (Size(bytes, /N_DIMENSION) eq 1) ? 1 : dims[1:*]
  return, Float(bytes, 0, floatDims)
end
 
function ByteArrayToDouble, bytes
  compile_opt idl2
 
  dims = Size(bytes, /DIMENSION)
  if (dims[0] ne 8) then Message, 'Invalid byte array, first dimension must be 8'
  doubleDims = (Size(bytes, /N_DIMENSION) eq 1) ? 1 : dims[1:*]
  return, Double(bytes, 0, doubleDims)
end
 
function decodeJSONFloatFromByteArray, inJson
  compile_opt idl2
 
  if (ISA(inJson, 'Hash')) then begin
    if (inJson.HasKey('encoding') && inJson.HasKey('bytes')) then begin
      if (inJson['encoding'] eq 'ByteArrayToFloat') then begin
        return, ByteArrayToFloat(inJson['bytes'])
      endif else if (inJson['encoding'] eq 'ByteArrayToDouble') then begin
        return, ByteArrayToDouble(inJson['bytes'])
      endif
    endif else begin
      outJson = Hash()
      foreach val, inJson, key do begin
        outJson[key] = decodeJSONFloatFromByteArray(val)
      endforeach
      return, outJson
    endelse
  endif else if (ISA(inJson, 'List')) then begin
    outJson = List()
    foreach el, inJson, index do begin
      outJson.Add, decodeJSONFloatFromByteArray(el)
    endforeach
    return, outJson
  endif
  return, inJson
end
 
function encodeJSONFloatToByteArray, inJson
  compile_opt idl2
 
  if (ISA(inJson, 'Hash')) then begin
    outJson = Hash()
    foreach val, inJson, key do begin
      outJson[key] = encodeJSONFloatToByteArray(val)
    endforeach
    return, outJson
  endif else if (ISA(inJson, 'List')) then begin
    outJson = List()
    foreach el, inJson, index do begin
      outJson.Add, encodeJSONFloatToByteArray(el)
    endforeach
    return, outJson
  endif else begin
    if (ISA(inJson, 'Float')) then begin
      dims = ISA(inJson, /SCALAR) ? 1 : Size(inJson, /DIMENSION)
      byteDims = [ 4, dims ]
      return, Hash('encoding', 'ByteArrayToFloat', 'bytes', Byte(inJson, 0, byteDims))
    endif else if (ISA(inJson, 'Double')) then begin
      dims = ISA(inJson, /SCALAR) ? 1 : Size(inJson, /DIMENSION)
      byteDims = [ 8, dims ]
      return, Hash('encoding', 'ByteArrayToDouble', 'bytes', Byte(inJson, 0, byteDims))
    endif
  endelse
  return, inJson
end

 

The use of the encode and decode functions is easy to see:

IDL> j = List(Hash('factory', 'foo', 'value', Findgen(3,2)), Hash('factory', 'bar', 'val', Dindgen(4,3)))
IDL> j
[
    {
        "factory": "foo",
        "value": [[0.00000000, 1.0000000, 2.0000000], [3.0000000, 4.0000000, 5.0000000]]
    },
    {
        "factory": "bar",
        "val": [[0.00000000000000000, 1.0000000000000000, 2.0000000000000000, 3.0000000000000000], [4.0000000000000000, 5.0000000000000000, 6.0000000000000000, 7.0000000000000000], [8.0000000000000000, 9.0000000000000000, 10.000000000000000, 11.000000000000000]]
    }
]
IDL> j2 = encodeJSONFloattoByteArray(j)
IDL> j2
[
    {
        "factory": "foo",
        "value": {
            "bytes": [[[0, 0, 0, 0], [0, 0, 128, 63], [0, 0, 0, 64]], [[0, 0, 64, 64], [0, 0, 128, 64], [0, 0, 160, 64]]],
            "encoding": "ByteArrayToFloat"
        }
    },
    {
        "factory": "bar",
        "val": {
            "bytes": [[[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 240, 63], [0, 0, 0, 0, 0, 0, 0, 64], [0, 0, 0, 0, 0, 0, 8, 64]], [[0, 0, 0, 0, 0, 0, 16, 64], [0, 0, 0, 0, 0, 0, 20, 64], [0, 0, 0, 0, 0, 0, 24, 64], [0, 0, 0, 0, 0, 0, 28, 64]], [[0, 0, 0, 0, 0, 0, 32, 64], [0, 0, 0, 0, 0, 0, 34, 64], [0, 0, 0, 0, 0, 0, 36, 64], [0, 0, 0, 0, 0, 0, 38, 64]]],
            "encoding": "ByteArrayToDouble"
        }
    }
]
IDL> j3 = decodeJSONFloatFromByteArray(j2)
IDL> j3
[
    {
        "factory": "foo",
        "value": [[0.00000000, 1.0000000, 2.0000000], [3.0000000, 4.0000000, 5.0000000]]
    },
    {
        "factory": "bar",
        "val": [[0.00000000000000000, 1.0000000000000000, 2.0000000000000000, 3.0000000000000000], [4.0000000000000000, 5.0000000000000000, 6.0000000000000000, 7.0000000000000000], [8.0000000000000000, 9.0000000000000000, 10.000000000000000, 11.000000000000000]]
    }
]

 

As I mentioned previously, the easier to implement code results in less efficient encoding.  In this case the string returned by calling JSON_Serialize(j) is 139 characters long, while JSON_Serialize(j2) returns a 455 character string, an expansion of 227%.  The other problem is this isn’t as easy to implement in other languages that we might be sending the encoded JSON stream to.

 

Float to Hex String

A more common representation of floating point numbers is the hexadecimal string of the raw binary representation.  Instead of converting each Float or Double to a 4- or 8-element array of Bytes, we convert it to a 8 or 16 character string built from the hexadecimal values of each of those Bytes.  There are a couple ways to accomplish this, but I took the tack of converting the floating point array to an unsigned integer array of the same dimensions and appropriate bit depth (ULong for Float, ULong64 for Double).  We could convert from Float to Byte array and then to ULong, but there’s no need in this case the ULong() function will perform the appropriate reinterpret_cast<> operation.  I then convert the ULong values to strings using the FORMAT keyword in the String() function to specify hexadecimal output.  I could have used the simple FORMAT=’(Z)’, but I like determinism so I used ‘(Z08)’ and ‘(Z016)’ to always output the same number of characters, zero padding the strings that could be expressed in less than 8 (or 16) characters.  The conversion to String does result in a 1-D array, so I use Reform() to convert it to the expected dimensions.  One nice improvement of this encoding over the first case is that the dimensionality of the data doesn’t change.

Decoding the hexadecimal strings is basically the inverse process.  I get the dimensions of the hexadecimal string(s) and allocate a ULong or ULong64 array of the same dimensions.  I then use READS to read from the string(s) into the long(s), and use the FORMAT keyword set to tell it to expect the strings to be hexadecimal format.  I can set the format to ‘(Z)’ and it will parse all the strings into the corresponding longs.  I then pass the long(s) into the Float or Double functions, with offset 0 and the dimensions of the hexadecimal strings.

Here is the full code:

function HexStringToFloat, hexStr
  compile_opt idl2
 
  dims = ISA(hexStr, /SCALAR) ? 1 : Size(hexStr, /DIMENSION)
  longs = ULonArr(dims, /NOZERO)
  Reads, hexStr, longs, FORMAT='(Z)'
  return, Float(longs, 0, dims)
end
 
function HexStringToDouble, hexStr
  compile_opt idl2
 
  dims = ISA(hexStr, /SCALAR) ? 1 : Size(hexStr, /DIMENSION)
  longs = ULon64Arr(dims, /NOZERO)
  Reads, hexStr, longs, FORMAT='(Z)'
  return, Double(longs, 0, dims)
end
 
function decodeJSONFloatFromHexString, inJson
  compile_opt idl2
 
  if (ISA(inJson, 'Hash')) then begin
    if (inJson.HasKey('encoding') && inJson.HasKey('hex_string')) then begin
      if (inJson['encoding'] eq 'HexStringToFloat') then begin
        return, HexStringToFloat(inJson['hex_string'])
      endif else if (inJson['encoding'] eq 'HexStringToDouble') then begin
        return, HexStringToDouble(inJson['hex_string'])
      endif
    endif else begin
      outJson = Hash()
      foreach val, inJson, key do begin
        outJson[key] = decodeJSONFloatFromHexString(val)
      endforeach
      return, outJson
    endelse
  endif else if (ISA(inJson, 'List')) then begin
    outJson = List()
    foreach el, inJson, index do begin
      outJson.Add, decodeJSONFloatFromHexString(el)
    endforeach
    return, outJson
  endif
  return, inJson
end
 
function encodeJSONFloatToHexString, inJson
  compile_opt idl2
 
  if (ISA(inJson, 'Hash')) then begin
    outJson = Hash()
    foreach val, inJson, key do begin
      outJson[key] = encodeJSONFloatToHexString(val)
    endforeach
    return, outJson
  endif else if (ISA(inJson, 'List')) then begin
    outJson = List()
    foreach el, inJson, index do begin
      outJson.Add, encodeJSONFloatToHexString(el)
    endforeach
    return, outJson
  endif else begin
    if (ISA(inJson, 'Float')) then begin
      dims = ISA(inJson, /SCALAR) ? 1 : Size(inJson, /DIMENSION)
      longs = ULong(inJson, 0, dims)
      hex = String(longs, FORMAT='(Z08)')
      return, Hash('encoding', 'HexStringToFloat', 'hex_string', Reform(hex, dims))
    endif else if (ISA(inJson, 'Double')) then begin
      dims = ISA(inJson, /SCALAR) ? 1 : Size(inJson, /DIMENSION)
      longs = ULong64(inJson, 0, dims)
      hex = String(longs, FORMAT='(Z016)')
      return, Hash('encoding', 'HexStringToDouble', 'hex_string', Reform(hex, dims))
    endif
  endelse
  return, inJson
end

 

The effects of these encode and decode functions is seen here:

IDL> j = List(hash('factory', 'foo', 'value', findgen(3,2)), hash('factory', 'bar', 'val', dindgen(4,3)))
IDL> j
[
    {
        "factory": "foo",
        "value": [[0.00000000, 1.0000000, 2.0000000], [3.0000000, 4.0000000, 5.0000000]]
    },
    {
        "factory": "bar",
        "val": [[0.00000000000000000, 1.0000000000000000, 2.0000000000000000, 3.0000000000000000], [4.0000000000000000, 5.0000000000000000, 6.0000000000000000, 7.0000000000000000], [8.0000000000000000, 9.0000000000000000, 10.000000000000000, 11.000000000000000]]
    }
]
IDL> j2 = encodeJSONFloatToHexString(j)
IDL> j2
[
    {
        "factory": "foo",
        "value": {
            "hex_string": [["00000000", "3F800000", "40000000"], ["40400000", "40800000", "40A00000"]],
            "encoding": "HexStringToFloat"
        }
    },
    {
        "factory": "bar",
        "val": {
            "hex_string": [["0000000000000000", "3FF0000000000000", "4000000000000000", "4008000000000000"], ["4010000000000000", "4014000000000000", "4018000000000000", "401C000000000000"], ["4020000000000000", "4022000000000000", "4024000000000000", "4026000000000000"]],
            "encoding": "HexStringToDouble"
        }
    }
]
IDL> j3 = decodeJSONFloatFromHexString(j2)
IDL> j3
[
    {
        "factory": "foo",
        "value": [[0.00000000, 1.0000000, 2.0000000], [3.0000000, 4.0000000, 5.0000000]]
    },
    {
        "factory": "bar",
        "val": [[0.00000000000000000, 1.0000000000000000, 2.0000000000000000, 3.0000000000000000], [4.0000000000000000, 5.0000000000000000, 6.0000000000000000, 7.0000000000000000], [8.0000000000000000, 9.0000000000000000, 10.000000000000000, 11.000000000000000]]
    }
]

 

The efficiency of this encoding scheme is marginally better than the first, 450 characters for j2 vs 455.  This may not seem like much, but the 450 is very deterministic due to every Float yielding 8 characters and every Double 16.  The 455 characters from the first encoder is almost a best case scenario, since there were lots of 0 bytes which only need 1 character.  On average a Byte should need 2.57 characters (10 1 character values, 90 2 character values, 156 3 character values).  The 139 characters in the original JSON stream is also rather minimalist, since every floating point value was an exact integer, so it only used one digit to the right of the decimal point.

Let’s consider something more real world, such as a 2D array of random numbers:

IDL> orig = randomu(1, 4, 5)
IDL> orig
      0.41702199      0.99718481      0.72032452      0.93255734
   0.00011438108      0.12812445      0.30233258      0.99904054
      0.14675589      0.23608898     0.092338592      0.39658073
      0.18626021      0.38791075      0.34556073      0.66974604
      0.39676747      0.93553907      0.53881675      0.84631091
IDL> byte_array = encodeJSONFloatToByteArray(orig)
IDL> byte_array
{
    "bytes": [[[232, 131, 213, 62], [129, 71, 127, 63], [48, 103, 56, 63], [20, 188, 110, 63]], [[224, 223, 239, 56], [14, 51, 3, 62], [86, 203, 154, 62], [31, 193, 127, 63]], [[45, 71, 22, 62], [79, 193, 113, 62], [4, 28, 189, 61], [161, 12, 203, 62]], [[255, 186, 62, 62], [61, 156, 198, 62], [86, 237, 176, 62], [122, 116, 43, 63]], [[27, 37, 203, 62], [125, 127, 111, 63], [229, 239, 9, 63], [213, 167, 88, 63]]],
    "encoding": "ByteArrayToFloat"
}
IDL> hex_string = encodeJSONFloatToHexString(orig)
IDL> hex_string
{
    "hex_string": [["3ED583E8", "3F7F4781", "3F386730", "3F6EBC14"], ["38EFDFE0", "3E03330E", "3E9ACB56", "3F7FC11F"], ["3E16472D", "3E71C14F", "3DBD1C04", "3ECB0CA1"], ["3E3EBAFF", "3EC69C3D", "3EB0ED56", "3F2B747A"], ["3ECB251B", "3F6F7F7D", "3F09EFE5", "3F58A7D5"]],
    "encoding": "HexStringToFloat"
}
IDL> StrLen(JSON_Serialize(orig)), StrLen(JSON_Serialize(byte_array)), StrLen(JSON_Serialize(hex_string))
         394
         364
         276

 

It’s even better for Doubles:

IDL> orig = randomu(1, 4, 5, /DOUBLE)
IDL> orig
      0.41702200470257400      0.72032449344215810   0.00011437481734488664      0.30233257263183977
      0.14675589081711304     0.092338594768797799      0.18626021137767090      0.34556072704304774
      0.39676747423066994      0.53881673400335695      0.41919451440329480      0.68521950039675950
      0.20445224973151743      0.87811743639094542     0.027387593197926163      0.67046751017840223
      0.41730480236712697      0.55868982844575166      0.14038693859523377      0.19810148908487879
IDL> byte_array = encodeJSONFloatToByteArray(orig)
IDL> byte_array
{
    "bytes": [[[6, 60, 250, 15, 125, 176, 218, 63], [81, 240, 186, 243, 229, 12, 231, 63], [0, 192, 97, 102, 144, 251, 29, 63], [244, 8, 254, 183, 106, 89, 211, 63]], [[60, 5, 199, 163, 229, 200, 194, 63], [16, 202, 176, 140, 128, 163, 183, 63], [228, 225, 52, 230, 95, 215, 199, 63], [206, 163, 91, 189, 170, 29, 214, 63]], [[232, 251, 123, 103, 163, 100, 217, 63], [84, 159, 98, 151, 252, 61, 225, 63], [138, 149, 129, 58, 21, 212, 218, 63], [39, 35, 25, 114, 81, 237, 229, 63]], [[12, 98, 24, 199, 125, 43, 202, 63], [74, 22, 235, 188, 137, 25, 236, 63], [192, 172, 103, 68, 126, 11, 156, 63], [169, 229, 167, 71, 120, 116, 229, 63]], [[254, 90, 168, 51, 31, 181, 218, 63], [11, 9, 185, 125, 201, 224, 225, 63], [220, 170, 6, 255, 50, 248, 193, 63], [68, 72, 116, 188, 99, 91, 201, 63]]],
    "encoding": "ByteArrayToDouble"
}
IDL> hex_string = encodeJSONFloatToHexString(orig)
IDL> hex_string
{
    "hex_string": [["3FDAB07D0FFA3C06", "3FE70CE5F3BAF051", "3F1DFB906661C000", "3FD3596AB7FE08F4"], ["3FC2C8E5A3C7053C", "3FB7A3808CB0CA10", "3FC7D75FE634E1E4", "3FD61DAABD5BA3CE"], ["3FD964A3677BFBE8", "3FE13DFC97629F54", "3FDAD4153A81958A", "3FE5ED5172192327"], ["3FCA2B7DC718620C", "3FEC1989BCEB164A", "3F9C0B7E4467ACC0", "3FE5747847A7E5A9"], ["3FDAB51F33A85AFE", "3FE1E0C97DB9090B", "3FC1F832FF06AADC", "3FC95B63BC744844"]],
    "encoding": "HexStringToDouble"
}
IDL> StrLen(JSON_Serialize(orig)), StrLen(JSON_Serialize(byte_array)), StrLen(JSON_Serialize(hex_string))
         391
         659
         437

 

The hex string encoding of the Float array took 75.8% of the space of the byte array encoding, while the encoding of the Double array took only 66.3% of the space.