WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!
Posts by Richard Haselgrove |
21)
Message boards :
News :
Project server code update
(Message 113212)
Posted 2 Jul 2014 by Richard Haselgrove Post: BTW I slow down my cruchers here since don´t belive quantity is what you´re looking for and now they will produce a stable number of daily WU. I think that's probably a good idea. We're already at the stage where my last 12 consecutive validations have been against one or other of your hosts (5 different machines, I think). And the machines are all pretty similar, to each other and to mine: GTX 670/690/780, running Win7/64 or (in one case) Server 2008. In order to see (now) and test (later) BOINC's behaviour in the real world, we probably need a reasonable variation in hosts to give us realistic variation in the times and credits. Bernd has launched a new 'BRP5' (Persueus Arm Survey) v1.40, with a Beta app tag on it, to test that new feature in the BOINC scheduler. I'm in the process of switching my machine over to run that instead: some company would be nice, but be warned: we're half expecting to fall over the 'EXIT_TIME_LIMIT_EXCEEDED' problem at some stage with BRP5 Beta, so hosts running it probably need to be watched quite closely for strange estimated runtimes, and you need to be ready to take action to correct it. |
22)
Message boards :
Problems and Bug Reports :
Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
(Message 113208)
Posted 2 Jul 2014 by Richard Haselgrove Post: I think we're going to have real problems with the Gamma-ray pulsar search #3 app for a while. I posted that my host 11362 was getting runtime estimates of 12 minutes, time allowed 4 hours, @ 20 GHz. Turns out that two of the three tasks I've returned so far would have exceeded bounds if I hadn't inoculated them. So my GTX 470 GPU is running at an effective rate of 1 GHz or less. As is described elsewhere, this app is very much still a work-in-progress, where very little work is done on the GPU, and most of it still on the CPU - it wants a full CPU core, and uses it to the hilt. Similarly, TJ's GTX 660 has been taking around three hours for the matching tasks over at the main Einstein project. So that makes even more of a mockery of the server dishing out a bounds limit of four minutes for his machine - his speed must be mis-estimated by a factor of 1,000 or so. And to put the icing on the cake, all three of my returned results have been paired with different anonymous Intel HD 2500 GPUs running with the dodgy OpenCL 1.1 driver that Claggy noticed. Inconclusive, the lot of them. It's going to take a while to get the server averages back into kilter... |
23)
Message boards :
News :
Project server code update
(Message 113207)
Posted 2 Jul 2014 by Richard Haselgrove Post: I wonder why Claggy's laptop gets such variable credit? Conversely, when he's paired with me - now back to lower, stable, runtimes - no jackpot, no bonus. Sorry 'bout that. |
24)
Message boards :
Problems and Bug Reports :
Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
(Message 113202)
Posted 1 Jul 2014 by Richard Haselgrove Post: I have the same errors, but my wing(wo)man with nVidia cards also have this error. If done by a CPU then it is validated. So I think it has something to do with the GPU app. Yes, visible now, thanks. I assume we're talking about Error Gamma-ray pulsar search #3 tasks for computer 7731 - tasks issued yesterday. Unfortunately, Application details for host 7731 shows no APR for that app, because none of the tasks completed successfully. And the server log https://albert.phys.uwm.edu/host_sched_logs/7/7731 isn't much use either, because the last scheduler contact was to report work only, with no new work requested. What I'd like to see, if at all possible, is a copy of the server log for an example of a work request where an FGRP task was issued. It would look something like 2014-07-01 17:18:03.1608 [PID=30917] [version] Checking plan class 'FGRPopencl-nvidia' Note that in my case (from host 11362) the server is estimating - last line - that the task will run for 746 seconds (which is what I'm seeing locally too), and won't be thrown out with a time limit error for over four hours. That's calculated from "using conservative projected flops: 20.12G" a few lines above (which is a new one on me). Since your tasks error out in under 4 minutes, I assume the initial estimates must have been 20 times smaller than that - 12 seconds or something. What I'd ideally like to see is a similar server log from your machine, showing the GFlops value it's using to calculate your runtime. You have to be quick to catch it: there seem to be very few tasks around at the moment, and I had to try several times. Then, you have to capture the server log within a minute, otherwise another attempt will overwrite the successful one (unless you set NNT before your computer asks again). There's something very odd about the way the Albert server is setting these estimated speeds, and we haven't fully got to the bottom of it yet. |
25)
Message boards :
Problems and Bug Reports :
Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
(Message 113200)
Posted 1 Jul 2014 by Richard Haselgrove Post: I have the same errors, but my wing(wo)man with nVidia cards also have this error. If done by a CPU then it is validated. So I think it has something to do with the GPU app. @ tjreuter, Could you possibly unhide your host(s) at this project, or give us a direct link to the one you're having problems with? It would help us to give you more specific advice, and it would also help us (and the project) to understand more clearly why this problem happens in the first place. |
26)
Message boards :
News :
Project server code update
(Message 113197)
Posted 1 Jul 2014 by Richard Haselgrove Post: Latest scattergram. I've reverted my 5367 to normal running (early afternoon yesterday), so my timings *should* be lower and steadier - doesn't really seem to show in credit yet. I wonder why Claggy's laptop gets such variable credit? |
27)
Message boards :
Problems and Bug Reports :
Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
(Message 113196)
Posted 1 Jul 2014 by Richard Haselgrove Post: Specifically: Fetch some FGRP (Gamma-ray pulsar search) work. Exit BOINC completely Edit <rsc_fpops_bound> as Eyrie describes. You'll find it in the <workunit> definition for each of the tasks you've downloaded. Restart BOINC, and allow the tasks to run and report as usual. Probably best to set 'No New Tasks' while you do this. Once you've reported and validated 11 tasks, the procedure should no longer be necessary. If you didn't get 11 validations from the first batch, repeat as needed. |
28)
Message boards :
News :
Project server code update
(Message 113189)
Posted 29 Jun 2014 by Richard Haselgrove Post: Um, if you don't mind, I think it might be best to wait a little time. The administrators on this project are based in Europe, and as you know Jason is ahead of our time-zone, in Australia. I think it might be better to wait 12 hours or so, until we have a chance to compare notes by email when the lab opens in the morning. After all, we don't want to use up our entire supply of unattached new hosts in one hit, or else we won't have anything left to test Jason's patches with.... |
29)
Message boards :
News :
Project server code update
(Message 113185)
Posted 29 Jun 2014 by Richard Haselgrove Post: At least one of those must be upside down. Well, we do (crudely) have two separate cases to deal with. 1) initial attach. We have to get rid of that divide-by-almost-zero, or hosts can't run. They get the absurdly low runtime estimate/bound and error when they exceed it. 2) steady state. In my (political) opinion, trying to bring back client-side DCF will be flogging one dead horse too many. We need some sort of server-side control of runtime estimates, so that client scheduling works and user expectations are met. I'm happy to accept that the new version will be different to the one we have now, and look forward to seeing it. OK, I'll get out of your hair, and take my coffee downstairs to grab some more stats. |
30)
Message boards :
News :
Project server code update
(Message 113182)
Posted 29 Jun 2014 by Richard Haselgrove Post: See edit to my last. In my view, if the relevant numbers are all <<1, we should be multiplying by them, not dividing by them. Out of coffee error - going shopping. Back soon. |
31)
Message boards :
News :
Project server code update
(Message 113180)
Posted 29 Jun 2014 by Richard Haselgrove Post: app version pfc is normalised to 0.1 (design flaw), and any real samples would have driven it toward 0.05 or lower . so that text should be 10-20x+ marketing flops, and is NOT the intent, nor remotely correct design. It's Gibberish. The advice given to project administrators in http://boinc.berkeley.edu/trac/wiki/AppPlanSpec is: <gpu_peak_flops_scale>x</gpu_peak_flops_scale> I'm wondering whether they put in 0.1, expecting this to be a multiplier (real flops are lower than peak flops), but end up dividing by 0.1 instead? And from what you say, 'default 1' doesn't match the code either? Edit: the alternative C++ documentation for plan_classes is in http://boinc.berkeley.edu/trac/wiki/PlanClassFunc. There, the example is .21 // estimated GPU efficiency (actual/peak FLOPS) At least one of those must be upside down. |
32)
Message boards :
News :
Project server code update
(Message 113176)
Posted 29 Jun 2014 by Richard Haselgrove Post: Now the server side, that 'Best version of app' striing comes from sched_version.cpp (scheduler inbuilt functions) and uses the following resources: That's OK, I can do a text search in sched_version.cpp same as you. What would perhaps be most useful would be an expanded table of all those TLA variable names, with your assessment of what David intended them to mean, and of what they actually mean in practice. Looking back at the thread openers, I reported: client 192 GFLOPS peak, based on PFC avg: 2124.60G I can't quickly find the client GFLOPS peak number for Claggy's ATI 'Capeverde' with "based on PFC avg: 34968.78G". I'd like to look for the variable (presumably a struct member) where we might expect GFLOPS peak to be stored, and see what it's multiplied by in those initial stages before 11 completions establish an APR. We might expect 0.1 from the words, but we seem to be using >10 by the numbers. |
33)
Message boards :
News :
Project server code update
(Message 113174)
Posted 29 Jun 2014 by Richard Haselgrove Post: Ah allright, Yes, concentrating on the current code and moving it forward is certainly the right approach - but it's probably worth just being aware of the steps we moved through to reach this point, because it can influence compatibility problems that could arise in the future. As we've discussed, DCF was deprecated from client v7.0.28, and in the server code from a little earlier. But not everything in the BOINC world moves in lockstep, so we have older and newer servers in use, and we also have older and newer clients in use. Older servers take account of client DCF when scaling runtime estimates prior to allocating work: [send] active_frac 0.999987 on_frac 0.999802 DCF 0.776980 Newer servers don't: [send] on_frac 0.999802 active_frac 0.999987 gpu_active_frac 0.999978 Those are both the same machine (the one I've been graphing here), which explains why on_frac and active_frac are identical. But the first line comes from the Einstein server log, and the second line from the Albert server log. So, even my late-alpha version of BOINC (v7.3.19) is maintaining, using and reporting DCF against an 'old server' project which needs it. Good compatibility choice. But the reverse case is not so happy. An older client (I'm talking standard stock clients here, not Jason's specially-tweaked client) will do on using and reporting DCF as before, because it doesn't parse the <dont_use_dcf/> tag. But the newer server code has discarded DCF completely, and doesn't scale its internal runtime estimates when presented with a work request from a client which is still using it. This can - and does - result in servers allocating vastly different volumes of work from what the client expects, because the estimation process doesn't have all the same inputs. Say, for the sake of argument, that an 'old' (pre-v7.0.28) client has got itself into a state with DCF=100, and asks for 1 day of work. For the BRP4G tasks we're studying here, we'd all expect the server to allocate maybe 20 tasks, and the client to agree with the server calculation of estimated runtime, slightly over 1 day. But if the client is using DCF, and the server isn't, that can appear as a 100 day work work cache when the client does the local calculation. That's a case where server-client compatibility breaks down, and breaks down badly. |
34)
Message boards :
News :
Project server code update
(Message 113162)
Posted 29 Jun 2014 by Richard Haselgrove Post: From treblehit's server log https://albert.phys.uwm.edu/host_sched_logs/11/11519 2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way. |
35)
Message boards :
News :
Project server code update
(Message 113155)
Posted 28 Jun 2014 by Richard Haselgrove Post: With new hosts and a new monitor, let's see how that looks. I've knock out the old data (and with it, the extreme data points) - but even so, Juan's new machines show very wide scatter. Here's that in figures: Jason Holmis Claggy Juan Juan Juan RH RH Host: 11363 2267 9008 10352 10512 10351 5367 5367 GTX 780 GTX 660 GT 650M GTX 690 GTX 690 GTX 780 GTX 670 GTX 670 Credit for BRP4G, GPU Maximum 2708.58 2197.18 10952.0 7209.47 6889.8 6652.9 4137.85 Minimum 115.82 88.84 153.90 1667.23 1244.41 1546.02 1355.49 Average 1326.79 1277.87 3631.58 2728.70 2198.10 2463.06 2007.02 Median 1541.35 1411.09 2426.03 2135.67 1948.04 2091.49 1910.19 Std Dev 628.07 690.05 2712.34 1403.91 942.62 969.59 305.80 nSamples 76 102 71 52 43 44 459 Runtime (seconds) (before)(after) Maximum 5027.36 5088.99 11295.0 5605.83 8922.7 3182.0 4191.43 5099.40 Minimum 3239.20 3294.83 8122.09 3081.97 3854.24 1852.2 4061.45 4284.52 Average 3645.57 4549.28 8902.94 4411.88 6305.41 2342.3 4128.08 4686.13 Median 3535.46 4769.05 8847.82 3673.33 5127.40 1864.0 4127.35 4672.83 Std Dev 344.17 456.55 508.22 998.49 1932.50 615.41 20.40 204.66 365 94 Turnround (days) Maximum 6.09 3.91 2.75 0.08 0.45 0.22 0.91 Minimum 0.13 0.07 0.13 0.04 0.05 0.02 0.15 Average 1.94 1.46 0.90 0.05 0.09 0.03 0.67 Median 1.46 1.54 0.79 0.04 0.06 0.03 0.69 Std Dev 1.78 1.00 0.65 0.01 0.06 0.03 0.12 All three of Juan's machines are showing a very wide variation in runtime - he'll have to explain that by local observation, I can't pick it up from the website. |
36)
Message boards :
News :
Project server code update
(Message 113154)
Posted 27 Jun 2014 by Richard Haselgrove Post: RH - Please let me know if it would be more helpful to simply switch my 7950 from BRP5 to BRP4 or to "remove project" / "add project" (presumably that would create a new host and therefore start credit calcs fresh). Also, is it easier for you if I only run 1 WU at a time? Remove project / add project doesn't normally change the HostID - BOINC is designed to recycle the numbers, if for example it recognises the IP address and hardware configuration. Doesn't matter if it's one at a time or multiples at at time, but it's probably best if you don't mix task types (whether from this project or across projects). If I do start monitoring your host - thanks for the offer - it would help the other observers if you could tell us a bit about any configuration details which can't be observed from the outside - and GPU utilisation factor is one of those. Don't bust a gut changing things over. I need a bit of a breather, and to set up and get used to a replacement monitor: and Bernd needs to test some more new server code fixes next week, which will give us a new set of apps (designated as 'beta', but in reality the same as the existing ones) with blank application_details records to have a go at. |
37)
Message boards :
News :
Project server code update
(Message 113151)
Posted 27 Jun 2014 by Richard Haselgrove Post: Nice graph Richard, maybe you could consider to add one of my 2x690 hosts 10512 or 10352 since there are no 690 on the graph and they produce a lot of WU. Yes, I'm planning to refresh the graph with new hosts, and they might be suitable. What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes). The sheer number of tasks pushed through isn't particularly important, but the consistency is. It didn't help that Zombie took the two hosts I'd picked off to another project (he's still running other hosts - they crop up in my wingmate lists from time to time), and Mikey leaving the project because it isn't exporting public stats would rule him out. It's quite time-consuming to switch things over, so bear with me - for the time being at least, old results aren't being deleted here, so there's no rush. |
38)
Message boards :
News :
Project server code update
(Message 113149)
Posted 27 Jun 2014 by Richard Haselgrove Post: Right, so Richard's were sort of converging but are all over the place now. Two more for your viewing pleasure. I've started to take out the older hosts, which are returning very few tasks these days, but they served their purpose. Red is now Juan's 10351 (the one he linked two tasks from) - classic view for a new host. And this is mine, still showing scatter from the new configuration. We'll have to wait a few days before Juan will fit on the same scale (although he validated a couple of my oldies overnight - thank you). I'll keep the configuration stable until sunday night/monday morning, but I'll have to flip back then - I have some held tasks with deadlines. |
39)
Message boards :
News :
Project server code update
(Message 113142)
Posted 26 Jun 2014 by Richard Haselgrove Post: OK, the effect of my configuration change continues and is even clearer. I simply changed the nature (but not the number) of the tasks running on the CPU while this BRP4G test was running on the GPU. Here are the runtime stats of the two runs (Maximum / Minimum / Average / Median / Std Dev / nSamples): (before) (after) 4191.43 5034.97 4061.45 4417.27 4128.11 4707.30 4127.66 4668.20 20.45 181.84 339 43 and the corresponding graph I'm told some new hosts are coming online, so that we can watch and examine the "new host / stable (!) project" scenario in detail. I'll add them to the graphs - probably replacing the old hosts on the log graph, since none of them are returning much data now - as soon as I see successful BRP4G tasks coming back in. |
40)
Message boards :
News :
Project server code update
(Message 113134)
Posted 25 Jun 2014 by Richard Haselgrove Post: Updating both graphs, to show a new effect. This morning, I was asked to change the running configuration on my host 5367, for an unrelated reason. As a result, the maximum runtime for these tasks went up from 4137.85 seconds to 4591.35 - nearly 11%. The first task back after that - before APR had a chance to respond, obviously - is the high outlier at 2474.34 I think that's further evidence of the kind of instability we need to cure. |