[New release] BRP app v1.23/1.24 (OpenCL) feedback thread

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Author	Message
Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112089 - Posted: 16 May 2012, 20:32:34 UTC - in response to Message 112088. Hmmm... GPU temperature is ok?? I see one result that IS valid, so it's not like strictly all results are junk. Out of curiorisity I would underclock the card and see what happens. Sometimes hardware just fails, e.g. I have one fairly old NVIDIA GT 9800 that tends to produce long runs of invalid results, and then again returns to normal. I have a strong suspicion that for that particular card this correlates strongly with (room) temperature. I consider it semi-broken by now and shut it down during summer. So there can be a grey zone between good and broken. Cheers HB ID: 112089 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112090 - Posted: 16 May 2012, 20:45:16 UTC - in response to Message 112085. Last modified: 16 May 2012, 20:45:51 UTC Just out of curiosity, was the Einstein app ever run in double precision and compared to results of single precision calculations? I presume it was based on "does not need", but I'd be interested to know the difference. If memory serves me right, the BRP (then called ABP-) app started with code that indeed used double precision for some parts of its computations, and ran only on CPUs. When the idea came up to implement a GPU version, the code was changed to use single precision in those parts (almost all of the code) that were supposed to go on the GPU. At that point the scientists made sure that the ability to find pulsars wasn't compromised by this change. Note that the task of the app is not to determine the characteristics of a pulsar detection to extremely high precision (this is done in post-processing pulsar candidates and using re-observations), but to find candidate signals that stick out of the noise sufficiently clear to follow up on them. While this statement is simplifying things quite a bit, it gives you an intuitive idea why single precision is ok for this search. Cheers HB ID: 112090 · Reply Quote

Infusioned Send message Joined: 11 Feb 05 Posts: 45 Credit: 149,000 RAC: 0	Message 112091 - Posted: 17 May 2012, 0:41:08 UTC - in response to Message 112090. Last modified: 17 May 2012, 0:45:41 UTC Ah I understand. You need a way to cut through all the junk and the volunteers are the garbage filter; which means good enough detection is ok. Understood. Also, I checked my Milkway@Home history to see if I was having validation issues there: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=429181 and all my work is validated instantly because they are set to a minimum quorum of 1. I don't know if that's due to the fact that I have 44 million credit and the I am being considered a trusted source (if such a thing is even designated by the server), or that's just how the project is. I don't remember it being that way (I thought it used to be quorum of 2). So now, that makes me nervous. If my results are off, the project isn't comparing them. And, the project is double precision so that means the results need to be accurate. ID: 112091 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112092 - Posted: 17 May 2012, 5:58:04 UTC - in response to Message 112091. Last modified: 17 May 2012, 5:58:57 UTC I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-) Cheers HB ID: 112092 · Reply Quote

zombie67 [MM] Send message Joined: 10 Oct 06 Posts: 130 Credit: 30,924,459 RAC: 0	Message 112093 - Posted: 17 May 2012, 15:33:32 UTC - in response to Message 112089. Hmmm... GPU temperature is ok?? It is OC slightly. I will move back to stock and see if that maks a difference. Dublin, California Team: SETI.USA ID: 112093 · Reply Quote

Infusioned Send message Joined: 11 Feb 05 Posts: 45 Credit: 149,000 RAC: 0	Message 112094 - Posted: 17 May 2012, 15:54:50 UTC - in response to Message 112092. Last modified: 17 May 2012, 16:21:38 UTC I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-) Cheers HB Excellent. I will read it in chunks to break up the day as I need breaks from my work. Thanks. Edit: Ok I lied I read it all just now. So it seems that bad results aren't quite so bad, but still negatively effect things. And, ironically enough, they do have trusted/untrusted host status for users. I will try to dig more on this because I see I have a lot of inconclusive results for Einstein now. For what it is worth, I know there was an issue with NVIDIA cards silently overflowing and generating bad numbers on the Seti Beta app. However, that still doesn't excuse bad numbers from AMD 6xxx cards if that's the issue. ID: 112094 · Reply Quote

zombie67 [MM] Send message Joined: 10 Oct 06 Posts: 130 Credit: 30,924,459 RAC: 0	Message 112095 - Posted: 17 May 2012, 22:55:15 UTC Last modified: 17 May 2012, 23:02:10 UTC Looks like reducing the OC solved it. I also upgraded from 12.3 to 12.4. So I can't be 100% sure. But whatever the case, It's working again. Also, FWIW, I am running 3 at a time (.33), and still only ~45% GPU load. And this is with cores reserved, so the CPU has only ~90% load. Is it possible to get to >90% GPU load? Is there an upper limit on the number of simultaneous tasks? Dublin, California Team: SETI.USA ID: 112095 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112096 - Posted: 18 May 2012, 17:13:23 UTC - in response to Message 112095. Looks like reducing the OC solved it. I also upgraded from 12.3 to 12.4. So I can't be 100% sure. But whatever the case, It's working again. Also, FWIW, I am running 3 at a time (.33), and still only ~45% GPU load. And this is with cores reserved, so the CPU has only ~90% load. Is it possible to get to >90% GPU load? Is there an upper limit on the number of simultaneous tasks? The upper limit is reached when the Video RAM is exhausted. So per GB of VRAM you should be able to execute at least 2, possibly 3 instances. It's hard to tell where the "sweet spot" is to maximize the overall output, so some experimentation with the number of "reserved" CPU cores (cores not allocated to CPU apps) and # of GPU jobs in parallel is the best way to find out. CU HB ID: 112096 · Reply Quote

Infusioned Send message Joined: 11 Feb 05 Posts: 45 Credit: 149,000 RAC: 0	Message 112098 - Posted: 24 May 2012, 1:22:28 UTC - in response to Message 112096. A little update: I PM'd Raistmer on the Seti Beta boards and asked him to read the last bit of this thread. He said he did not notice a higher failure rate with the 69xx series cards during his development of AMD apps. Also, poking though my MW wu's, I validate just fine against: CPU: 171830352 171730343 171601831 171601829 171650656 171850223 171838869 Anonymous GPU: 171940837 Other 69xx: (making sure my card isn't defective) 171917181 171954514 NVIDIA OpenCL: 171784516 HD 58xx GPU: 171907299 So, at this point, I am inclined to believe that my card isn't defective in specific, and that the 69xx series cards are producing valid results. Should I go back to doing Albert or Einstein wu's? ID: 112098 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112100 - Posted: 26 May 2012, 18:22:40 UTC - in response to Message 112098. Hi! The issue with the HD 6900 series is this: There is a specific function (used by the FFT lib we are using for the OpenCL apps) that is computed with less accuracy on HD 6900 cards than on others. This is confirmed by AMD. It is not even a defect or bug, because the OpenCL standard allows this behavior. To deal with it, we made an app that uses a more accurate, but somewhat slower variant of this function. On Einstein@Home, this special app version is now delivered to HD6900 cards running the OpenCL app. Bottom line: it is safe (validation wise) to resume computations on Einstein@Home with HD6900 cards. Cheers HB ID: 112100 · Reply Quote

Infusioned Send message Joined: 11 Feb 05 Posts: 45 Credit: 149,000 RAC: 0	Message 112102 - Posted: 28 May 2012, 1:04:06 UTC - in response to Message 112100. I'm glad you got to the bottom of things. I guess that means that I the next card I add will be a 79xx card instead of another 69xx. I can't imagine why AMD thought worse accuracy was acceptable considering their whole push for compute oriented video cards and APUs. Then again, maybe that's why things were changed with the 7xxx cards (assuming you had no errors with those)? Hats off for all the hard work in getting this app developed. ID: 112102 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112103 - Posted: 1 Jun 2012, 0:06:22 UTC - in response to Message 112102. Last modified: 1 Jun 2012, 0:07:06 UTC I'm glad you got to the bottom of things. I guess that means that I the next card I add will be a 79xx card instead of another 69xx. I can't imagine why AMD thought worse accuracy was acceptable considering their whole push for compute oriented video cards and APUs. Then again, maybe that's why things were changed with the 7xxx cards (assuming you had no errors with those)? It's actually not something you can blame AMD for (and they were quite helpful in diagnosing this issue). The function in question is documented to have implementation dependent accuracy. It was probably not a good idea for the author of the 3rd party FFT lib to make use of this function, but that's just my personal opinion. We will get rid of this part of code to make sure this doesn't hit us again with future cards. Cheers HB ID: 112103 · Reply Quote

robertmiles Send message Joined: 16 Nov 11 Posts: 19 Credit: 4,468,368 RAC: 0	Message 112104 - Posted: 2 Jun 2012, 3:07:53 UTC - in response to Message 112103. When you're able to try it on both HD 69xx cards and similar HD 79xx cards, could you give us the relative speeds of the two? Some of us would like that information before deciding which card to buy next. ID: 112104 · Reply Quote

zombie67 [MM] Send message Joined: 10 Oct 06 Posts: 130 Credit: 30,924,459 RAC: 0	Message 112105 - Posted: 8 Jun 2012, 13:38:41 UTC Last modified: 8 Jun 2012, 13:40:36 UTC * Known issue: no OpenCL support for Mac OS X for the time being (we're still looking into a potential Apple bug) I could swear that I saw a message yesterday, talking about how this was fixed (hopefully). But I can't find it now, and I cannot get any tasks for my mac. Was I hallucinating? Edit: It was over at Collatz. D'oh! Dublin, California Team: SETI.USA ID: 112105 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 112106 - Posted: 11 Jun 2012, 14:08:05 UTC - in response to Message 112105. * Known issue: no OpenCL support for Mac OS X for the time being (we're still looking into a potential Apple bug) I could swear that I saw a message yesterday, talking about how this was fixed (hopefully). But I can't find it now, and I cannot get any tasks for my mac. Was I hallucinating? Edit: It was over at Collatz. D'oh! Maybe you had sort of a vision, because I've just released, here on Albert, a version that indeed might work on Macs for AMD/OpenCL under OSX (Lion). :-) Cheers HBE ID: 112106 · Reply Quote