A 94x performance boost will not come from just writing some assembly instructions. It will come from changing terrible memory access patterns to be optimal, not allocating memory in a hot loop and using SIMD instructions. At 94x there could be some algorithmic changes too, like not doing redundant calculations on pixels that can have their order reversed and applied to other pixels.