Health & Lifestyle

A Venture and Adventure into Decompilation of Self-Modifying Code

Description
Since the advent of modern programming language compilers whereby a set of human readable instructions are syntactically and semantically parsed and then translated and optimized to a binary format readable by a machine or an interpreter, there has
Published
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Research Statement and ProposalGregory Morse www.gmorsecode.comgregory.morse@live.com 1Research Statement Since the advent of modern programming language compilers whereby a set of human readable instructions are syntactically and semantically parsed and then translated and optimized to a binary format readable by a machine or an interpreter, there has been a need for the reversal of the process which is generally known as decompilation. Yet wide gaps of knowledge have remained in decompilation given that it is can be modeled as an identical process to that performed by a compiler except the input and output take on a different appearance. The Von eumann architecture which modern computers are still based on, re!uires that the code and data remain in memory and operate on that which is contained within that memory as well yielding the possibility for code to modify itself which is merely a form of compression or obfuscation of the srcinal code. "y analyzing self#modifying code, and its implications for declarative programming languages as well as temporally, a model for decompilation can be described which will generalize and completely match the problem description handling the most complicated and generalized situations which are possible. 2Past and Future Research $hile at %ueensland &niversity of Technology, 'ristina 'ifuentes described in great detail various processes for decompilation including structure of the graphs, and definitions of various structures and elements that are re!uired during the process. (efinitions such as )basic blocks* and algorithms to produce the various '#language code#flow structuring from a general graph are foundational elements which can be built upon.+ means of self#modifying code restructuring has been attempted in a paper by "ertrand +nckaert, atias adou, and -oen (e "osschere in + odel for Self#odifying 'ode, yet the ideas here try to separate out the areas which are self#modifying and specific types of code can break assumptions to thepoint that a high#level code translation can only be rendered by putting a mathematical description of the entire instruction set and the actual data being executed alongside of it. +t certain times, there are no assumptions which can be made yielding a problematic situation when there is potential for code modification as for example if an external and unavailable library provides input to a routine, even the most advanced mathematical analysis may not be able to simplify certain self#modifying code down any simpler than such an instruction set description in code. 'onstraints would need to be provided such as that which could be done by hand or through detailed analysis of external components to provide constraints. 'onstraints can mathematically reduce or yield complete code restructuring possibilities and are a crucial subect in generalizing decompilation./ther research efforts and papers in the field of decompilation, incremental and full dynamic algorithms for properties of directed graphs including loop nesting forests, dominator trees, and topological ordering which is still a topic open for research.The topic will come up time and again, as it has practical applications as simple as source code recovery or as obscure as validation of code through self#checksums. 0t can be used as an optimization  tool or as a means of obfuscation sometimes by those protecting their software and at other times by malicious software writers as a means of avoiding detection. Dynamic Decompilation The idea behind this proposal is to create a decompilation algorithm which is generalized enough that every other algorithm to date is merely a simplified subset of it. The incremental and full dynamic algorithms although not re!uired, must be highlighted for efficiency of eliminating static#pass analysis in decompilation and moving towards a one#pass no assumption algorithm. Self#modifying code even in the absolute worst case scenarios where no determinations andoptimizations can be made will be handled and in cases where any sort of significant optimization is possible, a temporal analysis algorithm would be applied to achieve optimal code structuring and data#flow optimization that can be expressed in a high#level language. 0n the worst case, a mathematical description of the processor instruction set or a partial description if any simplification is possible would appear. Complexity analysis for self-modifying code "y temporally analyzing self#modifying code fragments or their interactions with each other, a complexity can be determined which can be a useful indicator for automated scanning or as a theoretical research topic in itself. $here there are no constraints present, unbounded complexity to the order of the complexity of the processor instruction set itself would be taken into consideration. 1iven the extraordinary facilities on#board a modern day processor chip with multiple stages, multiple cores, pipelines, caching, predictive branching, non#uniform numbers of clock cycles and other considerations, determining the complexity of a modern processor is a research field in it of its own right as simplification generally re!uires context. 2urthermore, parallelism is important while on this topic as whether on multiple, cores or threads or utilizing a single atomic pathway of execution would change the implications of self#modifying code where it could in certain cases yield strange race conditions where very complex behavior would result. Research !ighlights Mathematical descriptions of processor instruction sets The utilization of a pseudo#code high#level description of the entire processor instruction set would allow for in the most na3ve sense, a generated code which simply defines the code being decompiled as a data set input to this processor emulator loop. The e!uivalence and compilability is maintained yet the efficiency would be called into!uestion. 1iven that high#level languages often have no way to express self#modifying code, a special compiler would be needed to translate such code back to its srcinal binary form for sake of optimization. "emporal analysis of code #hich is stored as data + novel algorithm which tracks self#modifying code by treating it in a similar way to loop cycles where it is modeled parametrically as a temporal function such that simplification or transformation can be done through a system of parametric e!uations and the ability to make use of partial derivatives with respect to various time parameters given that there could be any number of independent time variables depending on the complexity of the algorithm utilizing the self#modifying code. $niformity of compilation and decompilation %y merging generality There has been little attempt due to the difficulty of decompilation and the difference in expressivity of machine code verses high#level source code at combining the process into a procedure which goes both ways. Yet the principles of compilation are fundamentally tied to those of decompilation given that it is merely an optional verification followed by translation and optimization process in either direction. This could yield better  compilers that have more generalized structuring and optimization algorithms as well as better test coverage for the tool produced. "he necessary reduction of o&erhead through incremental or full dynamic graph algorithms (ecompilation cannot rely on syntax to accurately use multiple stages or )passes* to divide up the work like can be done due to the stringent rules of high#level languages. 0nstead the entire decompiled graph ready to be translated to any other form should be maintained incrementally as the code is analyzed such that no part of the code is ever analyzed more than once, and no assumptions are ever made. Static code#flow analysis makes a great number of assumptions even beyond merely self#modifying code but also that of reach#ability of code which may not be reachable logically speaking. 0ncremental analysis should be coupled with incremental or even full dynamic algorithms which handlethe deletion of edges to a graph where appropriate so that topological orders, dominator trees and other important connected structures can be maintained efficiently given that the code and data flow graphs will grow and divide appropriately while many different properties must be maintained to allow for the structuring and simplifications or analysis which must take place to proceed with certainty in the decompilation process. !euristical approaches to structuring code functions The idea of functions or reusable units of code is one that must be defined by heuristics as it is an arbitrary distinction often based on the stack but given the popular optimization of inline#functions, one that re!uires further heuristical analysis to properly and efficiently do correctly. 0t is of course an absolute re!uirement of a decompiler given that recursion would otherwise yield and infinitely large source code output yet one that if done too aggressively might make the usefulness of the output more confusing and less readable. $hat heuristictools can be used to allow for various user#defined levels of source code optimization is worth analyzing as function definition could be seen as likely the most arbitrary distinction in the entire process. 'Moti&ations for Future Research &ntil readily available decompilers which can produce compilable and accurate code is available, this area will always be an active research topic. Theoretical assessments of the problem must be well understood on a practically implementable level before development of decompilers will become abundant on the market. The prevalence and rise in use of interpreted languages which allow certain important reductions through various assumptions has caused interest in the more general Von eumann problem to be reduced. Yet the problem shall remain a valid one given that self#modifying code has implications in source code recovery, security, malicious software, compression, obfuscation and other areas which software engineers will continue to maintain as being critical to their profession. The topic remains an interest in +' Transactions on 4rogramming 5anguages and Systems 6T/45+S7, 0888 Transactions on 'omputers and various conferences and ournals on computing theory. Some future applications are9 Design of high-le&el languages #hich ma(e producti&e use of self-modifying code o programming languages are designed around making use of self#modifying code for security, integrity, compression and other uni!ue features that it could offer. This in part is because it depends on the instruction set and high#level languages are by their very definition processor#independent. Yet optimization is a feature which is highly processor#dependent and self#modifying code could be used to  categorize aspects of a processor that are not normally thought of. "ranslation %et#een high-le&el languages 1iven the abundance of high#level languages on any given platform nowadays, there is constant interest in supporting more languages or going betweenthem with relative ease and simplicity as well as tasks like changing the bit size whereby the code is e!uivalent yet the processor uses a different size data and:or address bus. "ranslation %et#een machine languages /ften times, there are situations especially with legacy products where code developed for one processor must be run on another environment. 0f there is no source code, strictly performing binary translation becomes an option and is more efficient than the overhead of using an interpreter given that one interpretation would be enough to produce an e!uivalent set of binary instructions. 1oing back to a source code is not necessary but the challenges that Finding ne# uses of self-modifying code 0f self#modifying code was more maintainable, well#understood and practical, then much new interest in development in that area could resume which couldpotentially unlock more efficient and clever methods of programming. The processor manufacturers could also see new ways of architecting their instruction sets and chips to take advantage of self#modifying code programming patterns that could potentially reduce clock times, allow for different parallel programming patterns and increase efficiency of caching and predictive pathways. 4rocessor manufacturers are typically facing )oore;s law* in terms of increasing the clock#speed of chips based on the reduction of the size of transistors yet processors designed around self#modifying code could allow for groundbreaking reduction in lengths of pathways for various operations. The instruction set itself could become self#modifying in the same spirit if more was understood in this area which could potentially create a very secured and protected environment for computing or allow for a very significant content management control system as an example.
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks