Friday, March 31, 2006

Make sure to use the Icon Cache

It's sad when people take great efforts to optimize the memory usage of a program, only to have the optimization not used. It seems like, due to a packaging bug, Ubuntu isn't taking advantage of GTK's icon cache, wasting 300 kb per process.

Not sure what we can do to make sure every distro takes the actions necessary to get the best performance out of GNOME, I guess there's always the process of filing bugs.

Thursday, March 30, 2006

Zenity Memory Usage: Valgrind

So, on the latest Dapper beta, I tried running valgrind on an instance of Zenity, a fairly good example of a minimal gtk program. I noticed that stuff was getting allocated for the icon data:

==24632== 74,477 bytes in 2,864 blocks are still reachable in loss record 4,829 of 4,830
==24632==    at 0x401C422: malloc (vg_replace_malloc.c:149)
==24632==    by 0x479FF36: g_malloc (in /usr/lib/
==24632==    by 0x47AFC65: g_strdup (in /usr/lib/
==24632==    by 0x42BAD04: ??? (gtkicontheme.c:2161)
==24632==    by 0x42BAF11: ??? (gtkicontheme.c:1057)
==24632==    by 0x42BBC5F: gtk_icon_theme_lookup_icon (gtkicontheme.c:1244)
==24632==    by 0x42BC108: gtk_icon_theme_load_icon (gtkicontheme.c:1388)
==24632==    by 0x42B74C1: gtk_icon_set_render_icon (gtkiconfactory.c:1748)
==24632==    by 0x43D0CAB: gtk_widget_render_icon (gtkwidget.c:5337)

There were other stack traces here, accounting for over 200kb of memory. I tried regenerating my cache, this had no effect. Have to see what is going on. Also, this stack trace was very curious:

==24632== 337,656 bytes in 94 blocks are still reachable in loss record 4,830 of 4,830
==24632==    at 0x401C422: malloc (vg_replace_malloc.c:149)
==24632==    by 0x494FC0C: (within /usr/lib/
==24632==    by 0x41B0E9A: (within /usr/lib/
==24632==    by 0x41B6974: (within /usr/lib/
==24632==    by 0x41B237A: (within /usr/lib/
==24632==    by 0x41B4DA1: (within /usr/lib/
==24632==    by 0x41B02D1: pango_ot_info_get_gpos (in /usr/lib/
==24632==    by 0x41B0357: (within /usr/lib/
==24632==    by 0x41B0407: pango_ot_info_find_script (in /usr/lib/
==24632==    by 0x4C162E9: (within /usr/lib/pango/1.5.0/modules/
==24632==    by 0x45E4D12: (within /usr/lib/
==24632==    by 0x45F3FB3: pango_shape (in /usr/lib/
==24632==    by 0x45E7BCE: (within /usr/lib/
==24632==    by 0x45EAD1A: (within /usr/lib/
==24632==    by 0x45EB274: (within /usr/lib/
==24632==    by 0x45EBBCB: (within /usr/lib/
==24632==    by 0x42DA876: ??? (gtklabel.c:2027)

I wonder exactly what this is that needs 300 kb of memory.

Wednesday, March 22, 2006

libaudiofile patch

Jason Allen commented on my last blog with a patch to fix libaudiofile. This simple patch just adds const to lots of data tables. It trims the 92 kb of dirty private rss down to 8 kb, saving just under 2 mb desktop wide! Jason: this patch is great. Let's get it upstream, and also try to get the next round of distros (suse 10.1, dapper, fc5) to include this.

There are quite a few other libraries that are used by almost every GNOME process that could benefit from such constification. Some I saw from gnome-terminal:

      48 kb        0 kb       48 kb   /usr/lib/
      40 kb        0 kb       40 kb   /usr/lib/
      36 kb        0 kb       36 kb   /usr/lib/
      36 kb        0 kb       28 kb   /usr/lib/
      24 kb        0 kb       24 kb   /usr/lib/
      20 kb        0 kb       20 kb   /usr/lib/
      20 kb        0 kb       20 kb   /usr/lib/
      20 kb        0 kb       20 kb   /usr/lib/
      20 kb        0 kb       16 kb   /usr/lib/
      16 kb        0 kb       16 kb   /usr/lib/
      16 kb        0 kb       16 kb   /usr/lib/

Fixing one of these libraries will have the benefit multiplied by about 20.

We should also consider is reducing the number of processes on the desktop. For example, clock-applet takes up 2.7 MB of private dirty rss. 1.7 MB of this is the heap and stack, the other MB is the .data section of .so files. For the most part, these are constant costs we are going to experience with any process. Reducing the number of processes will reduce this problem.

Tuesday, March 21, 2006

Memory Usage with smaps

As most developers should know by now, the memory statistics given on Linux are mostly meaningless. We have the vmsize, which counts the total address space used by a process. However, this is not accurate because it counts pages that are not mapped in to memory. We also have rss, which measures pages mapped into memory. However, it multi-counts shared pages: every process gets X kb of rss due to libgtk. However, the majority of the pages in libgtk are shared across the processes.

What we really care about is the private rss, the amount of pages our process maps in to memory that are only used by our process. We'd also like to know the rss per mapping so that we can point-fingers/find-where-to-fix-the-bug.

Up until this point, such statistics have been hard to come by. No longer! The 2.6.16 kernel adds support for smaps, per-mapping data, including data on each mapping's rss usage. This data lives in /proc/$pid/smaps. However, the format of the smaps file is hard to digest. I've written a quick perl script which parses this into something more useful. It uses the Linux::Smaps module on CPAN.

An example of the data generated by this script:

VMSIZE:      41132 kb
RSS:         23052 kb total
         9212 kb shared
            0 kb private clean
        13840 kb private dirty
vmsize   rss clean   rss dirty   file
12768 kb        0 kb    12616 kb   [heap]
196 kb        0 kb      196 kb
120 kb        0 kb       92 kb   /usr/lib/
132 kb        0 kb       80 kb
 80 kb        0 kb       60 kb   [stack]
 48 kb        0 kb       48 kb   /usr/lib/
 40 kb        0 kb       40 kb   /usr/lib/
 36 kb        0 kb       36 kb   /usr/lib/
vmsize   rss clean   rss dirty   file
2848 kb     1596 kb        0 kb   /usr/lib/
1172 kb      624 kb        0 kb   /lib/tls/i686/cmov/
488 kb      400 kb        0 kb   /usr/lib/
900 kb      396 kb        0 kb   /usr/lib/
524 kb      360 kb        0 kb   /usr/lib/

The vmsize and total rss size are the statistics that everyone is used to. The rss size is split into private and shared. The private rss is what could be best called the process' memory usage.

Below this, we give the per-mapping statistics for private mappings. Most of these are either from the heap (eg, malloc'd data), or writable mappings in .so files (from the .data section). This output is especially helpful for diagnosing the second kind. Following, we give the same data for shared mappings (most of which are .so files, the executable code, etc).

Call to action: I would love to see the following done:

  • Check out some of the libraries with large writable segments. An extreme example of this is libaudiofile. This library has 92 kb of dirty, private rss (isn't that naughty!). To make matters worse, the library is used by 22 processes on my GNOME 2.14 desktop. This is about 2 MB of memory! Let's figure out why this is happening and make the data const. Also, it might be wise to see if we could reduce the number of programs using this library. We should try to find any other instances of this.
  • Let's get some smaps based data in gnome-system-monitor, and possibly more low level tools as well.
  • We should look at tools like exmap to get per-page level rss into. This is useful for finding out things like what pages does evolution use from libgtk and why. We can use this tool to figure out what we can do for low memory users: we can simulate high levels of swapping by allocating memory in a dummy process. We should then see what pages must be loaded from disk to use the desktop, evolution, etc.
  • It'd be great to set a community goal for 2.16. We will reduce the private rss used by all GNOME processes in setup X by Y MB. We should also take statistics from 2.14 to make sure that there are no memory usage regressions.

In other performance related news, somebody has finally gotten good statistics on Firefox's memory usage. It looks like Mozilla is leaking pixmaps when browsing with tabs. I think there are many people who would be made much happier if this could be fixed. This is really great data gathering, and I'd like to see more of it.