Serialization and de-serialization in Python (pickle)

While dealing with task that usually take a long time to process, streaming data, etc, serialization and de-serialization comes handy. Recently when applying deep learning for MINST dataset on laptop, this becomes a very useful operation.

What is serialization?

storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer) or transmitted (for example, across a network connection link) and reconstructed later (possibly in a different computer environment). The opposite process is called: deserialization (also called unmashalling).

In python, this can be easily implemented by using pickle module.

When to use Pickle?

Here are some common usage for this process:

1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)
2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)
3) storing python objects in a database
4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).
There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.

How to use Pickle?

Saving:

import pickle
with (open(‘save.p’,’wb’) as f:
pickle.dump(myStuff, f)

Loading:

try:
with open(‘save.p’,’rb’) as f:
myStuff = pickle.load(f)
except:
myStuff = defaultdict(dict)

alternatively:
myStuff = pickle.load(open(‘save.p’,’rb’))

Please note that, the argument ‘rb‘ is necessary while loading the pickled data.

Alternatives method

Using dill to pickle anything. Link: http://nbviewer.jupyter.org/gist/minrk/5241793