Pulp Fiction:
This post will cover the basics of 'Garbage Collection' which is a part of Java's automated memory management. The matter presented here has been compiled from Oracle/Sun's documentations and various posts available freely on the internet. I have tried to present things here in a simplified manner and have willing left out some intrinsic details. I suggest after going through this post you should, atleast once, hear the story straight from the horse's mouth which can found here
Life is beautiful
One of the major reason I love Java is because of its automated memory management feature. As opposed to languages that do not have an automated memory management system in place, Java relieves the programmer from worrying about freeing allocated memory. This enables the programmer to concentrate on implementing the business logic rather than tracing the causes of those pesky memory leaks. Also, as the programmer is not allowed to fiddle directly with the memory blocks, he cannot accidently (or purposely) crash the system by incorrectly freeing allocated memory. This restriction ensures program integrity. Apart from this, some Garbage collectors also help combat heap fragmentation (not covered in this post).
Pitfalls
One point to note here is that Garbage Collection activities consume the very same system resources that are meant to run the application. This may slow down the application or make it appear unresponsive when garbage collection is in progress. Also, since its automated, we do not have much say in regards to when it should take place. Fortunately, Garbage Collection algorithms have vastly improved overtime and hence slow/unresponsive behavior, if any, can largely be taken care of by appropriate heap space tuning and making use of the Garbage Collector which best suits our needs (more on this later in the post).
Some Definitions
Starting with basic definitions, objects that are no longer needed are called Garbage and the process of collecting and throwing away garbage to make space for new objects is called Garbage Collection. Process that does Garbage Collection is called Garbage Collector.
When does a non-garbage object become a garbage object?
The answer to the above question lies in the following four scenarios:
1) All references of that object is explicitly set to null (e.g. obj = null)
2) Object is created inside a block and reference goes out of scope
3) Parent object is set to null
4) If an object has only live references via WeakHashMap
How to identify garbage objects?
Naive Approach
The naive algorithm is to iterate over every reachable object and use a flag to mark the liveliness of that object. Objects that are left over are considered garbage and removed. This is a straight forward approach, but time taken to finish each cycle of garbage collection is proportional to the number of live objects. Hence, this approach will not scale up in applications that maintain a large set of live data. Can we do any better?
Generational Collection
It has been empirically observed in many programs that that most recently created objects are also those most likely to become unreachable(garbage) quickly. This means that objects have a very high infant mortality. Using this empirical observation, memory is managed in Java in generations i.e. objects of different ages are stored in different memory pools.
For simplicity sake, the memory here is divided into 2 parts. First part is called Young Generation (YG) and it holds newly created objects. Second part is called Tenured Generation (TG) and it holds objects that have been around for a while.
All the newly created objects are placed in the YG. When the YG becomes full, garbage collection takes place in YG. Garbage collection in YG is called Minor Collection. So, during minor collection all garbage objects in YG are removed. The live objects left in YG are then moved to the TG if they are old enough to be in TG else they remain in YG. This process goes on till TG gets filled, upon which Garbage collection in takes place in TG. This is called Major Collection.
If minor collection takes place, but there is not sufficient space in TG to hold the tenured objects, Full Garbage Collection (FullGC) takes place which is simply minor+major collection.
A bit more about Minor Collection
Now, lets see in detail how minor collection (garbage collection in YG) works. The YG is divided into 3 parts - Eden, Survior1 (S1) and Survior2 (S2). All newly created objects are placed in Eden. Once Eden gets filled, minor collection takes place and garbage objects in Eden are removed. Remaining objects are moved to S1. Now, when the Eden gets filled again, minor collection takes place and all garbage objects in Eden and S1 are removed. Remaining objects from both the pools are moved to S2. Similarly, in the next minor collection live objects in Eden and S2 pools are again moved back to S1. Hence, objects keep oscillating between Survior1 and Survior2 till they come of age. Once they do, they are moved to TG in the next minor collection. Important point to note here is that at any point of time, atleast one of S1 or S2 will always be empty.
Metrics for Performance
There are different algorithms for garbage collection. Their efficacy is measured on the following criteria:
1) Throughput - Percentage of total time not spent in garbage collection
2) Pauses - Times when an application appears unresponsive because garbage collection is happening
3) Footprint - Working set of a process measured in pages and cache lines
4) Promptness - Time between when an object becomes dead and when the memory becomes available.
The desired qualities of a good Garbage Collection algorithm would be high Throughput and Promptness and low Pauses and Footprint. But depending on the use-case, one factor may be traded for the other to achieve desired output. Eg.
Batch Processing Jobs: Aim is to finish the job in the least possible time even if it requires long pauses to achieve that. So here high throughput is given more preferences than low pauses.
Real Time Processing: Here pauses should low even if overall throughput is not at its optimum best.
Now, lets see if size of different generations can affect the above mentioned metrics (we can set the size of generations using the command line parameter -XX:NewRatio. For example, -XX:NewRatio=3 means YG:TG = 1:3 i.e. size of YG is a quarter of the total heap size).
Size of 'Young Generation' is large:
If size of YG is large, then frequency of minor collections will be less. This will hence increase the throughput but at the expense of footprint, promptness and pause times. Also, if the heap is of fixed size, then a large YG implies a smaller TG. And a smaller TG would lead to an increase in the frequency of major collections.
Size of 'Young Generation' is small:
A small YG will minimize pauses at the expense of throughput.
From the above analysis we can infer that there is no one right way to size the generations! The best possible bet is to let JVM take care of the size or you play around with different generation sizes and see what suits your application the best.
Now that we have covered the basics of Garbage Collectors, we will now briefly look into 3 types of collectors that are shipped with JVM. But before that we need to understand one more important feature called Young Generation Guarantee.
Young Generation Guarantee
Young Generation Guarantee means that TG must have enough free memory to accommodate all the live objects. Let me illustrate this with the help of a simple example. Suppose we run an application that creates objects none of which ever become garbage. After the Eden space is filled, minor collection will oscillate the objects between S1 and S2 till they come of age and are eventually moved to TG. Hence, TG must be large enough to accommodate all the live objects, which means TG must be atleast as large as the combined space occupied by Eden and the larger of the two survivor spaces. Garbage Collectors which provide a Young Generation Guarantee are called Young Generation Collectors.
Different Types of Garbage Collectors
1) Serial Collector
This garbage collector provides Young Generation Guarantee and is hence a Young Generation Collector. Prior to J2SE 5.0, JVM used serial collector by default for both minor and major collections. For the later versions, JVM became smart enough to choose the collectors based on the class of machine on which the application has been started.
2) Throughput Collector
In Throughput Collector, minor collections happens using a parallel version of a Young Generation Collector while the major collection happens using a Serial Collector.
Throughput Collector is a good choice for an application that is running on a large no. of processors since minor collection is done by multiple threads.
3) Concurrent Low Pause Collector
In Concurrent Low Pause Collector, minor collection happens using a parallel version of a Young Generation Copying Collector while most of the major collection happens concurrently with the execution of the application. So, if an application is using 10 processors, during major collection the application can continue using 9 processors, while the remaining one processor can be used for major collection. The point to note here is that Concurrent Low Pause Collector can be used only if application can share processor resources with the garbage collector when the application is running.
Since garbage collection takes place concurrently with the application, Concurrent Low Pause Collector is a good choice for applications that can benefit from shorter garbage collector pauses. e.g. Interactive applications. Also, it makes for a good choice when the application has a relatively large set of long-lived data and is running on a machine with multiple processors.
There are quite a few other Garbage Collector implementations details of which are beyond the scope of this article.
References:
- Tuning Garbage Collection with the 5.0 Java[tm] Virtual Machine
- Java's garbage-collected heap
- How Garbage Collection works in Java
- Taming the Java Garbage Collector
- Java Garbage Collection Distilled