Monday, June 10, 2013

How Computer Architecture and C++ can help a Java developer to improve in performance

Recently I came across an interesting example of how the knowledge of C/C++ and Computer Architecture can help a Java developer to code better performing code and want to share with you all.

Lets take an example of Matrix multiplication of two 1000x1000 matrices in two ways: (Download the source code from: https://docs.google.com/file/d/0BxxKTIP8UwuMbXpqb3JJYVN6Rm8/edit?usp=sharing)
1. The traditional logic   ( usualMatrixMul() function has this logic in the code shared )
This is the traditional way, which was taught to us in schools. We traverse matrix 1 horizontally and matrix 2 vertically, multiplying the corresponding elements and calculating partial sums and adding all partial sums at last to find the resulting matrix's element.

2. Slightly modified logic ( modifiedMatrixMul() function has this logic in the code shared )
 In this slightly modified algorithm, we first transpose martix 2, and do the same traditional logic, but catered to the transposed matrix 2.

A demo run of these two algorithms took the following time to run: (x86 processor, Win32)
Usual matrix multiplication algorithm takes 24797 ms
Modified matrix multiplication algorithm takes 8500 ms


How this small change in logic can reduce the time of run by about three times?
                                            Well, a little bit of knowledge about C/C++ and Computer Architecture can help you understand this.


1. From C/C++: Multidimensional arrays are stored linear in the memory
JVM is itself implemented in C/C++. Under the hood, Java arrays are mapped to C/C++ arrays as the dynamic memory allocation logic for Java arrays are . In C/C++ arrays, be in single/multi dimension/s, are stored linear in memory.
For instance,
Array X in memory is stored as: X11 X12 X21 X22
Array Y in memory is stored as Y11 Y12 Y21 Y22


2. From Computer Architecture: Processor Cache caches chunks of linear data from memory to Cache
When your CPU needs some data from the memory, it first consults the cache, if it has the data. If the cache has the data from expected memory address (called Cache Hit), it immediately gives the data to the CPU and CPU proceeds with its work. But if the cache haven't cached the memory address, there is a Cache Miss and the CPU then goes to main memory to fetch the block to cache and proceeds with its execution.

 Cache accessing time is in nanoseconds, where was accessing main memory (RAM) will take long time in comparison.

How these help in the slightly modified algorithm?
                                               When the array is very large, parts of the array are cached and when there is a miss on some other part of array, then the subsequent part of the array is cached linearly.

In case of matrix multiplication by the traditional logic, matrix X is cached linearly and access linearly, so Cache misses are less. But, for matrix Y, the cache is like Y11 Y12 Y21 Y22, but is accessed like Y11 Y21 Y12 Y22.

For large data, like 1000x1000 matrix, the cache miss for matrix Y will be huge and everytime, the CPU has to bring data from main memory which is comparatively a time expensive operation and as said earlier, cache works like caching a chunk of data in increasing linear way of addressable memory.

Hence, tweaking the logic little bit and access matrix Y linearly reduces cache misses and shows improvement in performance of time.


Download source code at: https://docs.google.com/file/d/0BxxKTIP8UwuMbXpqb3JJYVN6Rm8/edit?usp=sharing

This performance improvement is based on caching and is specific to the traditional algorithm and is true for any programming language, Java/C/C++. Just java developers need to know additionally the way java arrays are handled by JVM.