The goal of scene understanding is to recognize what objects in the image and where the objects are located. Hierarchical structure is commonly found in the natural scene images. This structure not only can help us to identity the objects but also how the small units interact to form the whole objects. Our algorithm is based on the level structure. We merge the neighboring segments continuously until they combined into the whole object. The result is a forest which contains several trees, one tree commonly represents one object. We introduce a machine learning model to describe the merge process, greedy inference to compute the best merge trees, and the max margin to learn the parameters. We cluster the segments features to initialize the parameter. The experiment result could be accepted.