Errors - 197 (0xc5) EXIT_TIME_LIMIT

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Author	Message
Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113232 - Posted: 5 Jul 2014, 21:17:40 UTC - in response to Message 113231. I've already reported to Bernd, by email: And, (4), a quite different gripe. https://albert.phys.uwm.edu/results.php?hostid=11362 is plodding through some FGRP #3 OpenCL tasks. EVERY SINGLE ONE (sorry for shouting) has been paired with a different one from a sequence of apparently identical, anonymous, "Intel(R) Core(TM) i3-3220 CPU" with HD 2500 iGPUs. So far, I've returned results paired with hosts: 9042, 9045, 9046, 9089, 9093, 9095, 9100, 9110, 9128 They were all created between 28 September and 2 October last year, all are still active (have contacted the server within the last 24 hours), and all have the faulty OpenCL v1.1 driver which makes all tasks inconclusive. Somebody is wasting a lot of time and electricity with those machines: they feel like a job lot, and although anonymous are probably on the same institutional account. I wondered if you could identify the account holder's email address, and persuade them to update their drivers? It would speed FGRP testing up a lot. He replied, I'll take a look; not sure this will fit in today, though. 'today' being Thursday 03 July. ID: 113232 · Reply Quote

Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113238 - Posted: 6 Jul 2014, 17:57:12 UTC - in response to Message 113232. Got my first invalid, where my task was matched against two OpenCL 1.1 running intel GPUs: Workunit 603716 Claggy ID: 113238 · Reply Quote

Jacob Klein Send message Joined: 6 Nov 11 Posts: 16 Credit: 2,938,967 RAC: 0	Message 113239 - Posted: 6 Jul 2014, 19:39:25 UTC Let's keep this thread on the topic of its subject, please. ID: 113239 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113240 - Posted: 6 Jul 2014, 19:45:03 UTC - in response to Message 113239. Let's keep this thread on the topic of its subject, please. Your last contribution to the subject, five days ago was. Just wanted to post to mention that I will likely no-longer be testing these problematic work units. Have you come back to testing, and if so, what have you discovered in the meantime about the cause of the problems? ID: 113240 · Reply Quote

Jacob Klein Send message Joined: 6 Nov 11 Posts: 16 Credit: 2,938,967 RAC: 0	Message 113241 - Posted: 6 Jul 2014, 19:58:14 UTC Last modified: 6 Jul 2014, 19:59:48 UTC I have not resumed testing on this issue, and do not anticipate doing so. I replaced my GTS 240 with a second GTX 660 Ti, and am focusing on GPUGrid and Poem. Regarding the issue, it seemed to be bad estimations (based on the existing GTX 660 Ti, which had a local exclude_gpu option set on this project) for tasks that ran on the uber weak GTS 240. As I said, rsc_fpops_bound was busted, and the problem was server-side. I don't think it's fixed yet, though I don't know for sure. Regards, Jacob ID: 113241 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113242 - Posted: 6 Jul 2014, 20:03:35 UTC - in response to Message 113241. I have not resumed testing on this issue, and do not anticipate doing so. I replaced my GTS 240 with a second GTX 660 Ti, and am focusing on GPUGrid and Poem. Regarding the issue, it seemed to be bad estimations (based on the existing GTX 660 Ti, which had a local exclude_gpu option set on this project) for tasks that ran on the uber weak GTS 240. As I said, rsc_fpops_bound was busted, and the problem was server-side. Regards, Jacob Exactly. Specifically during what we are calling "stage 2 of the onramp", between 100 global validations for the project as a whole, and 11 local validations for the individual host - the phase during which flops determined by "PFC avg" can be seen in the server logs. If you're not testing any more, and we understand that much, why do you wish to prevent us discussing other matters of mutual interest in this thread? ID: 113242 · Reply Quote

Jacob Klein Send message Joined: 6 Nov 11 Posts: 16 Credit: 2,938,967 RAC: 0	Message 113243 - Posted: 6 Jul 2014, 20:45:19 UTC - in response to Message 113242. This thread is for the "EXIT_TIME_LIMIT_EXCEEDED" error a user might get running these newer apps. A forum search on that error, will find this thread. I am actually still monitoring this thread for an answer. Any other problem, such as bad OpenCL versions generating bad results and bad validations, deserve to be in their own threads. Thanks. ID: 113243 · Reply Quote

Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113244 - Posted: 6 Jul 2014, 20:49:51 UTC - in response to Message 113242. Exactly. Specifically during what we are calling "stage 2 of the onramp", between 100 global validations for the project as a whole, and 11 local validations for the individual host - the phase during which flops determined by "PFC avg" can be seen in the server logs. And to get to those 100 global validations, and 11 local validations, tasks need to validate, having masses of hosts throwing inconclusives into the works is slowing down the process of recovering from the -197 errors, at least for the Gamma-ray pulsar search #3 Claggy ID: 113244 · Reply Quote

Jacob Klein Send message Joined: 6 Nov 11 Posts: 16 Credit: 2,938,967 RAC: 0	Message 113245 - Posted: 6 Jul 2014, 21:15:41 UTC I don't know hardly anything about how the server does its calculations. I had a problem, I reported the problem, and at some point I was hoping to receive an answer to the problem. In the meantime, I was expecting the thread to stay on-topic to the problem, to make it easier to find an answer to in the future. Maybe I'm old-fashioned. If you (the Albert team) need me to do additional testing on the "197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED" problem that my computer was receiving, I'd have to swap hardware to do it, but could do so. Let me know if you'd like to request that. Thanks, Jacob ID: 113245 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113249 - Posted: 11 Jul 2014, 6:50:42 UTC I think I found the problem. FGRP has been updated to version 1.12 as a result - the apps are identical. Anybody runs into further -197 time limit exceeded errors please report here ASAP. Please always include host ID - we can glean most variables from database dumps now, but if you can also state your peak_flops (from BOINC startup messages) that would be very helpful. Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113249 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113251 - Posted: 11 Jul 2014, 8:36:32 UTC Unfortunately, I think Eyrie has jumped the gun on this one. Speaking specifically about FGRP (Gamma-ray pulsar search #3) only: My NVidia 420M laptop (host 11359) has just been allocated new work from the v1.12 run. It was sent out with the 'conservative' (first stage onramp) speed estimate of 27.76 GFlops: that's very close to the 23.59 GFlops the same host achieves on BRP4G-cuda32-nv301. BUT: FGRP is a beta app, which makes very little use of the GPU as yet. It runs much, much slower than BRP4G-cuda32-nv301 on my hardware. The tasks would have got error 197 if I hadn't taken precautions. I can't say whether the problem is Einstein's programming, or NVidia's OpenCL implementation, but at this initial stage for the new app_version, we can't blame BOINC. But we're back to square one with the validation count. Could testers please run more of these tasks (with edited <rsc_fpops_bound>, so they can complete), please? We still need to test how BOINC handles the transitions at 100 validations for the app_version across the project as a whole, and 11 validations for each individual host. ID: 113251 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113253 - Posted: 11 Jul 2014, 11:46:28 UTC OK, guys, gals, and fellow alpha-testers. Again, speaking specifically about FGRP (Gamma-ray pulsar search #3) GPU apps only: I've got a workround for the -197 EXIT_TIME_LIMIT_EXCEEDED. First, set up an app_config.xml file containing this section: <app_config> <app> <name>hsgamma_FGRP3</name> <max_concurrent>1</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> Make BOINC read the file, and check that it's been found properly: this is important, else the next stage will try to make your computer run 10 tasks at once.... Second, check that you understand the host/venue mapping for your fleet, and identify which host(s) and venue(s) will be running the FGRP3/GPU tests. Go to the Albert@Home preferences page for your account, and for the venue(s) you've selected for the test, set the "GPU utilization factor of FGRP apps" low. Really low. Crazy low, like 0.1 This is madness (don't say you haven't been warned), but with app_config.xml keeping the lid on your machine, it works. Third, allow and fetch new FGRP3/GPU work for the machine. You should see the estimated run time for any work already cached on the machine jump five-fold - that's your confirmation that the setting has been transferred properly. That should get us through to the first 100 validations for the project as a whole. Watch out for further advice on how we might handle phase 2. ID: 113253 · Reply Quote

Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113255 - Posted: 11 Jul 2014, 15:58:21 UTC - in response to Message 113253. Last modified: 11 Jul 2014, 16:04:25 UTC I hope the validate error problem has been fixed, I'm just about to start my first, all my wingmen that have completed this Wu already have validate error: https://albert.phys.uwm.edu/workunit.php?wuid=603947 Otherwise this is going to be Another real long journey. Claggy ID: 113255 · Reply Quote

Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113256 - Posted: 11 Jul 2014, 17:39:45 UTC - in response to Message 113255. Last modified: 11 Jul 2014, 17:41:30 UTC That Wu is completed, and is showing as 'Completed, waiting for validation', but the 'In progress' wingman is an OpenCL 1.1 Intel GPU, so it's either going to be inconclusive or Validate error. A lot of my other Wu's were round 1 inconclusives with OpenCL 1.1 Intel GPUs, so they should validate straight away. Claggy ID: 113256 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113257 - Posted: 11 Jul 2014, 22:08:40 UTC Well, we're making progress with the validations. As at 22:00 UTC (Bernd has kindly given us some access to the statistics), we had the following validations, all from Windows/64. plan_class n FGRPopencl-ati 9 FGRPopencl-intel_gpu 5 FGRPopencl-nvidia 20 I'm still not sure what will happen when that last line reaches 100, but it looks like I won't have to keep watching overnight. ID: 113257 · Reply Quote

Trotador Send message Joined: 15 May 13 Posts: 6 Credit: 26,130,548 RAC: 0	Message 113258 - Posted: 12 Jul 2014, 6:15:46 UTC I continue having "Maximum elapsed time exceeded" error in all Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) tasks. Rest of WU seem to run ok. ID: 113258 · Reply Quote

Trotador Send message Joined: 15 May 13 Posts: 6 Credit: 26,130,548 RAC: 0	Message 113259 - Posted: 12 Jul 2014, 18:08:29 UTC - in response to Message 113258. Good I posted, last four units have finished ok!:) ID: 113259 · Reply Quote

Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED