Deprecated: Function get_magic_quotes_gpc() is deprecated in /srv/BOINC/live-webcode/html/inc/util.inc on line 640
[New release] BRP app v1.23/1.24 (OpenCL) feedback thread

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

[New release] BRP app v1.23/1.24 (OpenCL) feedback thread

Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112089 - Posted: 16 May 2012, 20:32:34 UTC - in response to Message 112088.  

Hmmm... GPU temperature is ok??

I see one result that IS valid, so it's not like strictly all results are junk.

Out of curiorisity I would underclock the card and see what happens.

Sometimes hardware just fails, e.g. I have one fairly old NVIDIA GT 9800 that tends to produce long runs of invalid results, and then again returns to normal. I have a strong suspicion that for that particular card this correlates strongly with (room) temperature. I consider it semi-broken by now and shut it down during summer. So there can be a grey zone between good and broken.

Cheers
HB
ID: 112089 · Report as offensive     Reply Quote
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112090 - Posted: 16 May 2012, 20:45:16 UTC - in response to Message 112085.  
Last modified: 16 May 2012, 20:45:51 UTC


Just out of curiosity, was the Einstein app ever run in double precision and compared to results of single precision calculations? I presume it was based on "does not need", but I'd be interested to know the difference.


If memory serves me right, the BRP (then called ABP-) app started with code that indeed used double precision for some parts of its computations, and ran only on CPUs. When the idea came up to implement a GPU version, the code was changed to use single precision in those parts (almost all of the code) that were supposed to go on the GPU. At that point the scientists made sure that the ability to find pulsars wasn't compromised by this change. Note that the task of the app is not to determine the characteristics of a pulsar detection to extremely high precision (this is done in post-processing pulsar candidates and using re-observations), but to find candidate signals that stick out of the noise sufficiently clear to follow up on them. While this statement is simplifying things quite a bit, it gives you an intuitive idea why single precision is ok for this search.

Cheers
HB
ID: 112090 · Report as offensive     Reply Quote
Infusioned

Send message
Joined: 11 Feb 05
Posts: 45
Credit: 149,000
RAC: 0
Message 112091 - Posted: 17 May 2012, 0:41:08 UTC - in response to Message 112090.  
Last modified: 17 May 2012, 0:45:41 UTC

Ah I understand. You need a way to cut through all the junk and the volunteers are the garbage filter; which means good enough detection is ok. Understood.


Also, I checked my Milkway@Home history to see if I was having validation issues there:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=429181

and all my work is validated instantly because they are set to a minimum quorum of 1. I don't know if that's due to the fact that I have 44 million credit and the I am being considered a trusted source (if such a thing is even designated by the server), or that's just how the project is. I don't remember it being that way (I thought it used to be quorum of 2).

So now, that makes me nervous. If my results are off, the project isn't comparing them. And, the project is double precision so that means the results need to be accurate.
ID: 112091 · Report as offensive     Reply Quote
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112092 - Posted: 17 May 2012, 5:58:04 UTC - in response to Message 112091.  
Last modified: 17 May 2012, 5:58:57 UTC

I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-)

Cheers
HB
ID: 112092 · Report as offensive     Reply Quote
Profile zombie67 [MM]
Avatar

Send message
Joined: 10 Oct 06
Posts: 130
Credit: 30,924,459
RAC: 0
Message 112093 - Posted: 17 May 2012, 15:33:32 UTC - in response to Message 112089.  

Hmmm... GPU temperature is ok??


It is OC slightly. I will move back to stock and see if that maks a difference.
Dublin, California
Team: SETI.USA

ID: 112093 · Report as offensive     Reply Quote
Infusioned

Send message
Joined: 11 Feb 05
Posts: 45
Credit: 149,000
RAC: 0
Message 112094 - Posted: 17 May 2012, 15:54:50 UTC - in response to Message 112092.  
Last modified: 17 May 2012, 16:21:38 UTC

I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-)

Cheers
HB


Excellent. I will read it in chunks to break up the day as I need breaks from my work. Thanks.


Edit:
Ok I lied I read it all just now. So it seems that bad results aren't quite so bad, but still negatively effect things. And, ironically enough, they do have trusted/untrusted host status for users.

I will try to dig more on this because I see I have a lot of inconclusive results for Einstein now. For what it is worth, I know there was an issue with NVIDIA cards silently overflowing and generating bad numbers on the Seti Beta app. However, that still doesn't excuse bad numbers from AMD 6xxx cards if that's the issue.
ID: 112094 · Report as offensive     Reply Quote
Profile zombie67 [MM]
Avatar

Send message
Joined: 10 Oct 06
Posts: 130
Credit: 30,924,459
RAC: 0
Message 112095 - Posted: 17 May 2012, 22:55:15 UTC
Last modified: 17 May 2012, 23:02:10 UTC

Looks like reducing the OC solved it. I also upgraded from 12.3 to 12.4. So I can't be 100% sure. But whatever the case, It's working again.

Also, FWIW, I am running 3 at a time (.33), and still only ~45% GPU load. And this is with cores reserved, so the CPU has only ~90% load. Is it possible to get to >90% GPU load? Is there an upper limit on the number of simultaneous tasks?
Dublin, California
Team: SETI.USA

ID: 112095 · Report as offensive     Reply Quote
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112096 - Posted: 18 May 2012, 17:13:23 UTC - in response to Message 112095.  

Looks like reducing the OC solved it. I also upgraded from 12.3 to 12.4. So I can't be 100% sure. But whatever the case, It's working again.

Also, FWIW, I am running 3 at a time (.33), and still only ~45% GPU load. And this is with cores reserved, so the CPU has only ~90% load. Is it possible to get to >90% GPU load? Is there an upper limit on the number of simultaneous tasks?


The upper limit is reached when the Video RAM is exhausted. So per GB of VRAM you should be able to execute at least 2, possibly 3 instances. It's hard to tell where the "sweet spot" is to maximize the overall output, so some experimentation with the number of "reserved" CPU cores (cores not allocated to CPU apps) and # of GPU jobs in parallel is the best way to find out.

CU
HB


ID: 112096 · Report as offensive     Reply Quote
Infusioned

Send message
Joined: 11 Feb 05
Posts: 45
Credit: 149,000
RAC: 0
Message 112098 - Posted: 24 May 2012, 1:22:28 UTC - in response to Message 112096.  

A little update:

I PM'd Raistmer on the Seti Beta boards and asked him to read the last bit of this thread. He said he did not notice a higher failure rate with the 69xx series cards during his development of AMD apps.

Also, poking though my MW wu's, I validate just fine against:

CPU:
171830352
171730343
171601831
171601829
171650656
171850223
171838869

Anonymous GPU:
171940837

Other 69xx: (making sure my card isn't defective)
171917181
171954514

NVIDIA OpenCL:
171784516

HD 58xx GPU:
171907299


So, at this point, I am inclined to believe that my card isn't defective in specific, and that the 69xx series cards are producing valid results.

Should I go back to doing Albert or Einstein wu's?
ID: 112098 · Report as offensive     Reply Quote
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112100 - Posted: 26 May 2012, 18:22:40 UTC - in response to Message 112098.  

Hi!


The issue with the HD 6900 series is this: There is a specific function (used by the FFT lib we are using for the OpenCL apps) that is computed with less accuracy on HD 6900 cards than on others. This is confirmed by AMD. It is not even a defect or bug, because the OpenCL standard allows this behavior.

To deal with it, we made an app that uses a more accurate, but somewhat slower variant of this function. On Einstein@Home, this special app version is now delivered to HD6900 cards running the OpenCL app.

Bottom line: it is safe (validation wise) to resume computations on Einstein@Home with HD6900 cards.

Cheers
HB

ID: 112100 · Report as offensive     Reply Quote
Infusioned

Send message
Joined: 11 Feb 05
Posts: 45
Credit: 149,000
RAC: 0
Message 112102 - Posted: 28 May 2012, 1:04:06 UTC - in response to Message 112100.  

I'm glad you got to the bottom of things. I guess that means that I the next card I add will be a 79xx card instead of another 69xx. I can't imagine why AMD thought worse accuracy was acceptable considering their whole push for compute oriented video cards and APUs. Then again, maybe that's why things were changed with the 7xxx cards (assuming you had no errors with those)?

Hats off for all the hard work in getting this app developed.
ID: 112102 · Report as offensive     Reply Quote
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112103 - Posted: 1 Jun 2012, 0:06:22 UTC - in response to Message 112102.  
Last modified: 1 Jun 2012, 0:07:06 UTC

I'm glad you got to the bottom of things. I guess that means that I the next card I add will be a 79xx card instead of another 69xx. I can't imagine why AMD thought worse accuracy was acceptable considering their whole push for compute oriented video cards and APUs. Then again, maybe that's why things were changed with the 7xxx cards (assuming you had no errors with those)?


It's actually not something you can blame AMD for (and they were quite helpful in diagnosing this issue). The function in question is documented to have implementation dependent accuracy. It was probably not a good idea for the author of the 3rd party FFT lib to make use of this function, but that's just my personal opinion. We will get rid of this part of code to make sure this doesn't hit us again with future cards.

Cheers
HB
ID: 112103 · Report as offensive     Reply Quote
robertmiles

Send message
Joined: 16 Nov 11
Posts: 19
Credit: 4,468,368
RAC: 0
Message 112104 - Posted: 2 Jun 2012, 3:07:53 UTC - in response to Message 112103.  

When you're able to try it on both HD 69xx cards and similar HD 79xx cards, could you give us the relative speeds of the two?

Some of us would like that information before deciding which card to buy next.
ID: 112104 · Report as offensive     Reply Quote
Profile zombie67 [MM]
Avatar

Send message
Joined: 10 Oct 06
Posts: 130
Credit: 30,924,459
RAC: 0
Message 112105 - Posted: 8 Jun 2012, 13:38:41 UTC
Last modified: 8 Jun 2012, 13:40:36 UTC

* Known issue: no OpenCL support for Mac OS X for the time being (we're still looking into a potential Apple bug)


I could swear that I saw a message yesterday, talking about how this was fixed (hopefully). But I can't find it now, and I cannot get any tasks for my mac. Was I hallucinating?

Edit: It was over at Collatz. D'oh!
Dublin, California
Team: SETI.USA

ID: 112105 · Report as offensive     Reply Quote
Profile Bikeman (Heinz-Bernd Eggenstein)
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 28 Aug 06
Posts: 1483
Credit: 1,864,017
RAC: 0
Message 112106 - Posted: 11 Jun 2012, 14:08:05 UTC - in response to Message 112105.  

* Known issue: no OpenCL support for Mac OS X for the time being (we're still looking into a potential Apple bug)


I could swear that I saw a message yesterday, talking about how this was fixed (hopefully). But I can't find it now, and I cannot get any tasks for my mac. Was I hallucinating?

Edit: It was over at Collatz. D'oh!


Maybe you had sort of a vision, because I've just released, here on Albert, a version that indeed might work on Macs for AMD/OpenCL under OSX (Lion). :-)

Cheers
HBE
ID: 112106 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration