Forums/General

cloud.map() using huge amounts of memory, failing

chris.mulligan
posted this on March 12, 2010 13:24

I'm running cloud.map on an admittedly somewhat sizable dataset. It's over 40k items, and about 1KB per item on disk, for a ~40MB CSV file. I pull the results from a csv file, running them through a map, then write them out again. I'm using chunksize=100. I've written local variations for this that are entirely lazy, so it never keeps the whole dataset in memory.

I have 3 versions right now, a normal single python thread, a version with multiprocessing.Pool.map(), and one using picloud. On my laptop it takes 16 minutes in single thread mode, and 8 minutes with two processes. On a server with 8 cores it takes only 1 minute 30 seconds with 8 procs.

The problem is when I try to run the cloud version. It sucks up all the RAM on the system. When I killed the cloud.map() version after 8 minutes it was using 1.7GB/2GB of real memory on my OSX system - everything it could find to gobble. Responsiveness went to hell quickly, and the first time I did it I had to reboot. The job never even showed up on my picloud web interface.

What was it doing that caused such horrible performance?

 

#this causes huge memory usage and dies
def cloudMain(words, input, output):
'''
Reads in the list of companies from wordListFile, tries to find them in inputFile,
exports results to outputFile.
'''
import cloud

jids = cloud.map(checkForMatches2, izip(input, repeat(words)), chunksize=100)
output.writerows(cloud.result(jids))

#this works fine on my system
def multiMain(words, input, output):
'''
Reads in the list of companies from wordListFile, tries to find them in inputFile,
exports results to outputFile.
'''
from multiprocessing import Pool
pool = Pool()

for row in pool.imap(checkForMatches2, izip(input, repeat(words)), 100):
output.writerow(row)

 

Comments

User photo
Aaron Staley
PiCloud, Inc.

Hi Chris,

The basic problem is that cloud.map does not support chunksize; only the iresult function does.

What is therefore happening is your entire iterator (izip) is being read into memory.  I suspect if you call zip(input, repeat(words)) (zip, not izip), you will run into the same problem.

The solution is to pack your data into several map jobs.  You can use the following high-level function to pull this off (it is not the most optimal implementation but it should work for you):

 

def chunking_map(map_func, map_args, chunksize = 100):
done = False
jids = []
  while not done:
    partial_args = []
    for _ in xrange(chunksize): #append chunksize args
      try:
        partial_args.append(map_args.next())
      except StopIteration:
        done = True
        break
    jids.extend(cloud.map(map_func,partial_args)) #call map with only chunksize number of args
  return jids
jids = chunking_map(checkForMatches2, izip(input,repeat(words)), chunksize=100)  #replace your cloud.map call with this

 

The one issue with this (and the reason this is not yet in the client) is that the webview will not show a single map job but instead many mapjobs that have 100 sub-jobs each.  That being said, this should work for your use case.

 

One final note:You will want to use cloud.iresult(jids,chunksize = 100) as opposed to cloud.result() to ensure that you don't put all of your result set in memory.

March 12, 2010 13:57
User photo
chris.mulligan

Thanks for such a speedy reply. I tried replacing the multiMain izip with zip. It holds steady at a total of 380MB for the main process, plus about 15MB for each of the subprocesses (the 15MB is the same as with izip). I ran the cloud version as-is on a linux server over here with 32GB of RAM. It got to 5GB, and then printed out a urllib2.HTTPError 500 exception.

 

I've modified the code like so. It was only using 40MB, and tons of jobs showed up on the control panel. However it crashed after a while, before writing any data to my output file. On the dashboard it says some jobs were completed.

def cloudMain(words, input, output):
'''
Reads in the list of companies from wordListFile, tries to find them in inputFile,
exports results to outputFile.
'''
import cloud

def chunking_map(map_func, map_args, chunksize = 100):
done = False
jids = []
while not done:
partial_args = []
for _ in xrange(chunksize): #append chunksize args
try:
partial_args.append(map_args.next())
except StopIteration:
done = True
break
jids.extend(cloud.map(map_func,partial_args)) #call map with only chunksize number of args
return jids

jids = chunking_map(checkForMatches2, izip(input, repeat(words)), chunksize=100)
output.writerows(cloud.iresult(jids, chunksize=100))






#after it ran for almost 5 minutes
Traceback (most recent call last):
File "./closeenough.py", line 190, in <module>
cloudMain(words, input, output)
File "./closeenough.py", line 137, in cloudMain
jids = chunking_map(checkForMatches2, izip(input, repeat(words)), chunksize=100)
File "./closeenough.py", line 135, in chunking_map
jids.extend(cloud.map(map_func,partial_args)) #call map with only chunksize number of args
File "/Library/Python/2.6/site-packages/cloud-1.8.2-py2.6.egg/cloud/cloud.py", line 1037, in map
jids = self.adapter.jobs_map(params=parameters,func=func,mapargs=argList)
File "/Library/Python/2.6/site-packages/cloud-1.8.2-py2.6.egg/cloud/transport/adapter.py", line 330, in jobs_map
logdata = (logprefix, self._report.pid if self._report else 1, logcnt))
File "/Library/Python/2.6/site-packages/cloud-1.8.2-py2.6.egg/cloud/transport/network.py", line 359, in jobs_map
resp = self.query(self.map_query, params)
File "/Library/Python/2.6/site-packages/cloud-1.8.2-py2.6.egg/cloud/transport/network.py", line 282, in query
return self.rawquery(self.url + url, post_values)
File "/Library/Python/2.6/site-packages/cloud-1.8.2-py2.6.egg/cloud/transport/network.py", line 262, in rawquery
raise CloudException(body.strip(), status=status, logger=cloudLog)
cloud.cloud.CloudException: Status 460: Map function missing from request.

March 12, 2010 14:13
User photo
Aaron Staley
PiCloud, Inc.

A quick update: We've been working round the clock to optimize the cloud module.  Expect an update in a few days which addresses the excess memory allocation and speeds up cloud calls with large amounts of data.

March 17, 2010 00:33
User photo
Aaron Staley
PiCloud, Inc.

Hi Chris,

The latest version of the cloud library (1.9) should fix this issue.  Please let us know how it goes.

This release adds the following features which will help with your issue:

  • Streaming of data arguments into serializer.  In the previous version, all data was being read in first and then serialized; this no longer occurs, speeding up things and reducing memory consumption.
  • Map streaming: Your previous error likely resulted from too much data going to the server in a single map request.  The client now streams map arguments, so this should no longer be an issue.
  • _fast_serialization.  If you really want to speed things up, set _fast_serialization to 2 to use Python's default serializer on arguments.  Note that this is not default, as it can fail for some types of arguments (see http://docs.picloud.com/client_adv.html#serialization-speedups).

Thanks,

Aaron

March 29, 2010 02:55
User photo
chris.mulligan

Thanks for the update. I installed 1.9 with easy_install -U. The good news is that it's using almost no RAM, 22m resident. The bad news is it doesn't seem to be actually working very well. I tried running with only 2000 records, but after 30 minutes it still hasn't returned results. When I ran with the full dataset (~23k records) I got a HTTP 503 error after 11 minutes.

<pre>

[Thu Apr 08 08:32:29 2010] - [ERROR] - Cloud.HTTPConnection: HttpConnection.rawquery: Error while opening http connection.
Traceback (most recent call last):
  File "closeenough.py", line 177, in <module>
    cloudMain(words, input, output)
  File "closeenough.py", line 124, in cloudMain
    jids = cloud.map(checkForMatches2, izip(input, repeat(words)), chunksize=1000)
  File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/cloud.py", line 1057, in map
    jids = self.adapter.jobs_map(params=parameters,func=func,mapargs=argList)
  File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/adapter.py", line 336, in jobs_map
    logdata = (logprefix, self._report.pid if self._report else 1, logcnt))
  File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/network.py", line 393, in jobs_map
    resp = self.query(self.map_query, params) #actual query
  File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/network.py", line 287, in query
    return self.rawquery(self.url + url, post_values)
  File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/network.py", line 251, in rawquery
    response = urllib2_file.urlopen(request)
  File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/util/urllib2_file.py", line 290, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 395, in open
    response = meth(req, response)
  File "/usr/lib/python2.6/urllib2.py", line 508, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.6/urllib2.py", line 433, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.6/urllib2.py", line 367, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 516, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable.

</pre>

April 08, 2010 09:18
User photo
chris.mulligan

Update: The 2000 record run failed with the same 503 error after 75 minutes.

April 08, 2010 10:11
User photo
Aaron Staley
PiCloud, Inc.

Hi Chirs,

 

Just wanted to let you know that this problem should be 100% resolved.

 

-Aaron

August 27, 2010 00:08