Zope Care Day

Today has been decreed Zope Care Day. Andy asked me to figure out
why PUT request would bloat Zope’s memory. We made a bet that if I
fixed the issue before he woke up he would eat his laptop without
salt!

Going to log whatever I find here so I don’t keep track of all the
findings.

  • The whole samba starts in ZServer/HTTPServer.py,
    zhttp_collector.

    • If the request size (?) is bigger than 524288 bytes (?) it
      uses a TemporaryFile to store the request data. Otherwise, it
      uses a cStringIO.StringIO. Fair enough, though I suppose that
      threshold could be smaller.
  • However, it just uses zhttp_collector if a CONTENT_LENGTH
    header is found! (zhttp_handler.handle_request)

  • That suggests that if the client (in our case davlib.py) doesn’t
    set Content-Length, a cStringIO.StringIO will be blindly
    created.

  • davlib.py (at least our modified version) seems to set
    Content-Length properly (and so does cadaver), so Zope is
    creating a temporary file as expected.

  • The next thing that happens is that the request is passed through
    cgi.FieldStorage, which creates yet another temporary file
    by reading the one zhttp_collector had created. This far,
    nothing read the whole file in memory, which is cool.

  • Next thing is traversing to a resource and calling it’s PUT
    method.

    • If the resource doesn’t yet exists:

      • A webdav.NullResource object is created, and it’s PUT
        method is called.

      • It looks for a Content-Type header on the request. If that’s
        not found, it tries to guess the content type from the
        filename. If that doesn’t happen, it tries a re.search to
        figure out if its a binary file. Humm… which seems like it
        will fail if the uploaded file is big as it will receive a file
        object here??

      • PUT looks for a PUT_factory method on the parent object,
        and if that’s not found, it uses
        NullResource._default_PUT_factory, which will:

        • Create a ZopePageTempate for a file ending with .pt
        • Create a DTMLDocument for anything with content-type of
          text/html, text/xml or text/plain
        • Create a OFS.Image for anything with content-type
          image/*
        • Create a OFS.File for anything else.
      • When inside CMF/Plone, PortalFolder implements
        PUT_factory and delegates to content_type_registry. That
        one may be reading the whole file in memory. Add note to
        check later.

    • After PUT_factory is called, everything behaves as if the file
      already existed.

    • The next step is delegating to the existing resource, or
      newly-created object PUT method.

    • When using OFS.File, Zope seems to behave exceptionally
      well. Here’s what happens:

      • The request body is read in 64K chunks into a linked list of
        Pdata objects.
      • The Pdata objects get a _p_jar immediately, and a
        subtransaction is triggered.
      • As the subtransaction is triggered, a TmpStore object is
        created to hold the transaction data temporarily.
      • The TmpStore creates yet another temporary file.
      • When the real transaction is commited, all the info on the
        TmpStore is copied over to the real storage.

Conclusion so far: Zope seems to be able to handle large files
correctly out of the box. The problem may lie somewhere inside
CMF/Plone.

Update: Found two places where Zope was reading the whole file in
memory.

  • NullResource.PUT does a REQUEST.get('BODY', ''), which reads the
    file into a string, thus loading the whole thing in memory.
  • Still in NullResource.PUT, after the object is created but
    before it is stitched into the storage, the PUT method for
    the object is called. OFS.File though reads the whole file
    into a single Pdata object if a _p_jar is not found.

Here’s a patch for both problems.

Next step is trying the same thing in the context of a CMF/Plone site.

Advertisement

3 thoughts on “Zope Care Day

  1. Nope
    BODY there is just used to deciding the content type, and even then just if it fails to guess the content type from the filename. Next thing on my tasklist is to fix that. It really shouldn’t try to look at BODY unless really really needed.

  2. Thanks heaps for this post, especially this quote:
    “””
    The request body is read in 64K chunks into a linked list of
    Pdata objects.
    “””
    Which confirms my observations and accounts for the inability to pickle any object greater than ~31MB (32000/64 = 500 chunks, and requires 1000 stack recursions to pickle with Pickler())

    Now I can write a sensible testcase, yay :)

    D

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.