chris.mulligan
posted this on March 12, 2010 13:24
I'm running cloud.map on an admittedly somewhat sizable dataset. It's over 40k items, and about 1KB per item on disk, for a ~40MB CSV file. I pull the results from a csv file, running them through a map, then write them out again. I'm using chunksize=100. I've written local variations for this that are entirely lazy, so it never keeps the whole dataset in memory.
I have 3 versions right now, a normal single python thread, a version with multiprocessing.Pool.map(), and one using picloud. On my laptop it takes 16 minutes in single thread mode, and 8 minutes with two processes. On a server with 8 cores it takes only 1 minute 30 seconds with 8 procs.
The problem is when I try to run the cloud version. It sucks up all the RAM on the system. When I killed the cloud.map() version after 8 minutes it was using 1.7GB/2GB of real memory on my OSX system - everything it could find to gobble. Responsiveness went to hell quickly, and the first time I did it I had to reboot. The job never even showed up on my picloud web interface.
What was it doing that caused such horrible performance?
#this causes huge memory usage and dies
def cloudMain(words, input, output):
'''
Reads in the list of companies from wordListFile, tries to find them in inputFile,
exports results to outputFile.
'''
import cloud
jids = cloud.map(checkForMatches2, izip(input, repeat(words)), chunksize=100)
output.writerows(cloud.result(jids))
#this works fine on my system
def multiMain(words, input, output):
'''
Reads in the list of companies from wordListFile, tries to find them in inputFile,
exports results to outputFile.
'''
from multiprocessing import Pool
pool = Pool()
for row in pool.imap(checkForMatches2, izip(input, repeat(words)), 100):
output.writerow(row)
Comments
Hi Chris,
The basic problem is that cloud.map does not support chunksize; only the iresult function does.
What is therefore happening is your entire iterator (izip) is being read into memory. I suspect if you call zip(input, repeat(words)) (zip, not izip), you will run into the same problem.
The solution is to pack your data into several map jobs. You can use the following high-level function to pull this off (it is not the most optimal implementation but it should work for you):
The one issue with this (and the reason this is not yet in the client) is that the webview will not show a single map job but instead many mapjobs that have 100 sub-jobs each. That being said, this should work for your use case.
One final note:You will want to use cloud.iresult(jids,chunksize = 100) as opposed to cloud.result() to ensure that you don't put all of your result set in memory.
Thanks for such a speedy reply. I tried replacing the multiMain izip with zip. It holds steady at a total of 380MB for the main process, plus about 15MB for each of the subprocesses (the 15MB is the same as with izip). I ran the cloud version as-is on a linux server over here with 32GB of RAM. It got to 5GB, and then printed out a urllib2.HTTPError 500 exception.
I've modified the code like so. It was only using 40MB, and tons of jobs showed up on the control panel. However it crashed after a while, before writing any data to my output file. On the dashboard it says some jobs were completed.
A quick update: We've been working round the clock to optimize the cloud module. Expect an update in a few days which addresses the excess memory allocation and speeds up cloud calls with large amounts of data.
Hi Chris,
The latest version of the cloud library (1.9) should fix this issue. Please let us know how it goes.
This release adds the following features which will help with your issue:
Thanks,
Aaron
Thanks for the update. I installed 1.9 with easy_install -U. The good news is that it's using almost no RAM, 22m resident. The bad news is it doesn't seem to be actually working very well. I tried running with only 2000 records, but after 30 minutes it still hasn't returned results. When I ran with the full dataset (~23k records) I got a HTTP 503 error after 11 minutes.
<pre>
[Thu Apr 08 08:32:29 2010] - [ERROR] - Cloud.HTTPConnection: HttpConnection.rawquery: Error while opening http connection.
Traceback (most recent call last):
File "closeenough.py", line 177, in <module>
cloudMain(words, input, output)
File "closeenough.py", line 124, in cloudMain
jids = cloud.map(checkForMatches2, izip(input, repeat(words)), chunksize=1000)
File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/cloud.py", line 1057, in map
jids = self.adapter.jobs_map(params=parameters,func=func,mapargs=argList)
File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/adapter.py", line 336, in jobs_map
logdata = (logprefix, self._report.pid if self._report else 1, logcnt))
File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/network.py", line 393, in jobs_map
resp = self.query(self.map_query, params) #actual query
File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/network.py", line 287, in query
return self.rawquery(self.url + url, post_values)
File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/transport/network.py", line 251, in rawquery
response = urllib2_file.urlopen(request)
File "/home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.9-py2.6.egg/cloud/util/urllib2_file.py", line 290, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 395, in open
response = meth(req, response)
File "/usr/lib/python2.6/urllib2.py", line 508, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.6/urllib2.py", line 433, in error
return self._call_chain(*args)
File "/usr/lib/python2.6/urllib2.py", line 367, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 516, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable.
</pre>
Update: The 2000 record run failed with the same 503 error after 75 minutes.
Hi Chirs,
Just wanted to let you know that this problem should be 100% resolved.
-Aaron