I worked on a project that involved working with MPEG Transport Stream and Digi-TV(DVB). EPGReader to be exact.
The MPEG TS is a binary format, where multiple fields can be defined within a single byte. I could not use Python’s struct module because it only works with bytes or larger and the fields I had were just a couple of bits. First I started with regular bitshifts and bitmasks, but I soon realised it was very error prone task for me. It was very easy to make mistakes and the code was not very readable either.
For example, first 3 bytes of MPEG TS header contains a SYNC_BYTE, which is 1 byte in size and always has the value 0x47. This byte is used to detect packets from the stream. Each packet is 188 bytes long. The next bit is “transport error indicator”, which is set by receiver hardware to flag errors in demulation( analog signal to bits ), next bit is “payload unit start indicator” indicating that the current packet starts a new payload of data, then comes “transport priority” bit and finally “packet id”. Normally I’d write code something like following:
data = read(3) # Get 3 bytes of data sync_byte = data >> 16 # Get bits 24-16 tei = data >> 15 & 0x1 # 16. bit payl_start= data >> 14 & 0x1 # 15. bit tp = data >> 13 & 0x1 # 14. bit pid = data & 0x1FFF # Get last 13 bits
On top of that I needed to store the values in dictionary. As you can see this is not very readable nor convenient. So I figured there must be something easier.
1. Meet BitReader
spec = ( # Name of the data to read 'sync_byte', # How many bits to read( 8 bits = 1 byte ) 8, 'tei', 1, 'payl_start', 1, 'tp', 1, 'pid', 13 ) reader = BitReader(spec) data = reader.read(read(3)) assert data.sync_byte == 0x47
And if and when one needed to add one more byte and couple of variables, that’s when the code starts to break with bitshifts. Any change to the original data size requires you to change the bitshifts accordingly. Also adding new values to the middle requires changes to bitshifts, in case you missed it in the spec the first time etc.
data = read(4) # Get 4 bytes of data sync_byte = data >> 24 # Get bits 32-24 tei = data >> 23 & 0x1 # 24. bit etc... not very interested on getting this right, but you'll get the idea
When using BitReader, I just give the variable name and how many bits it takes. Simple as that. No need to touch the other variables.
spec = ( # Name of the data to read 'sync_byte', # How many bits to read 8, 'tei', 1, 'payl_start', 1, 'tp', 1, 'pid', 13, 'scrambling', 2, 'has_adapt', 1 'has_payload', 1, 'continuity', 4 ) reader = BitReader( spec ) data = reader.read(read(4))
And it doesn’t matter if the new values are added to the beginning, middle or at the end.
2. About performance & syntax
BitReader is a bit slower than using bitshifts, but it was still easily fast enough for the task I worked on. And if compiled using Cython the performance nearly doubles without any code change.
Is it faster? – Performance, no, but you are faster. Have a cup of C if you want speed. Is it more readable? – Yes. Makes life easier? – You bet!
Somebody might look at the specification syntax and quickly note that I could have used dictionary instead. Unfortunately it is not possible because the order is needed and dictionary does not preserve it.
And what about using 2-tuples ( variable, bits )? Is it more readable and less error prone? Not sure, maybe, but I thought I’ll save myself from typing parenthesis 🙂
The specification syntax was inspired by domgen… or the other way around. Can’t remember which came first.
You can also convert the data back into binary format. The read returns a BitData object, which implements ‘dump()’ method, which returns an array.array(‘B’) containing the bytes. You can change the attributes of the BitData and then dump the data back into array and easily write it to a file using array.tofile(f) or send it to network.
A Javascript port might be interesting for web apps… Especially mobile apps, which often have slow connections. And a hand made C module would probably be at least as fast as the bitshifts on Python.
3. Project location
Get the code from bitbucket.
Perhaps you’d also be interested in Construct, an incredible general purpose binary parsing library: http://construct.wikispaces.com/
Is this along the same lines as what hachoir does ? http://bitbucket.org/haypo/hachoir/wiki/Home
MPEG_TS is already supported by the parser (at least to some level)
BitReader is a very low level tool for extracting bitdata and tries to keep it simple(json data specification). Construct seems to have similar goal but with more features and more complex api(data classes). Hachoir seems to do that and a lot more, but they should have a tutorial on how to get started( couldn’t find one ).
Interesting projects though. I wish I had known about them sooner 🙂
If I have 10 bytes (in a long?) parsed using BitReader, how do I convert that into a string? Thanks
If you know the location of the string and its length, just read the data from the position and convert the bytes into string type with plain python. BitReader can help you to read the size of the string if it is, for example, located just before the string itself in the data. Like so:
And here is how to convert long to string:
hello Kind people
Can anybody help me iam trying to read any file (data) whether it is video audio,text,.exe,html or any else. But the Problem is that i want the Bit level reading means Ijust want Bits as 1s and 0s through any language
can i do it if yes then please tell me
i will be very thankfull to you for this kindness.
my email id is
amanbheley.pcte@gmail.com and
jotbox@rediffmail.com
thanx
Thank you for your feedback. We are not offering consulting services or programming training for free.