WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Posts by Infusioned

1) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112102)
Posted 28 May 2012 by Infusioned
Post:
I'm glad you got to the bottom of things. I guess that means that I the next card I add will be a 79xx card instead of another 69xx. I can't imagine why AMD thought worse accuracy was acceptable considering their whole push for compute oriented video cards and APUs. Then again, maybe that's why things were changed with the 7xxx cards (assuming you had no errors with those)?

Hats off for all the hard work in getting this app developed.
2) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112098)
Posted 24 May 2012 by Infusioned
Post:
A little update:

I PM'd Raistmer on the Seti Beta boards and asked him to read the last bit of this thread. He said he did not notice a higher failure rate with the 69xx series cards during his development of AMD apps.

Also, poking though my MW wu's, I validate just fine against:

CPU:
171830352
171730343
171601831
171601829
171650656
171850223
171838869

Anonymous GPU:
171940837

Other 69xx: (making sure my card isn't defective)
171917181
171954514

NVIDIA OpenCL:
171784516

HD 58xx GPU:
171907299


So, at this point, I am inclined to believe that my card isn't defective in specific, and that the 69xx series cards are producing valid results.

Should I go back to doing Albert or Einstein wu's?
3) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112094)
Posted 17 May 2012 by Infusioned
Post:
I don't want to get too far off topic here, but it happens there is a paper specifically on the validation strategies for the type of simulation that is done at Milkyway@Home, written by the MW scientists: http://www.cs.rpi.edu/~szymansk/papers/dais10.pdf. Just to cure your nervousness :-)

Cheers
HB


Excellent. I will read it in chunks to break up the day as I need breaks from my work. Thanks.


Edit:
Ok I lied I read it all just now. So it seems that bad results aren't quite so bad, but still negatively effect things. And, ironically enough, they do have trusted/untrusted host status for users.

I will try to dig more on this because I see I have a lot of inconclusive results for Einstein now. For what it is worth, I know there was an issue with NVIDIA cards silently overflowing and generating bad numbers on the Seti Beta app. However, that still doesn't excuse bad numbers from AMD 6xxx cards if that's the issue.
4) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112091)
Posted 17 May 2012 by Infusioned
Post:
Ah I understand. You need a way to cut through all the junk and the volunteers are the garbage filter; which means good enough detection is ok. Understood.


Also, I checked my Milkway@Home history to see if I was having validation issues there:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=429181

and all my work is validated instantly because they are set to a minimum quorum of 1. I don't know if that's due to the fact that I have 44 million credit and the I am being considered a trusted source (if such a thing is even designated by the server), or that's just how the project is. I don't remember it being that way (I thought it used to be quorum of 2).

So now, that makes me nervous. If my results are off, the project isn't comparing them. And, the project is double precision so that means the results need to be accurate.
5) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112085)
Posted 15 May 2012 by Infusioned
Post:
The Einstein@Home app does not need (and does not use) any double precision arithmetic on the GPU, so this should not be a factor.



I am aware. The point I was trying to make, though, was that how the math is coded matters greatly and does impact precision of the final answer. Let's take for example pi^16 (exaggerated for show) with 3 different approximations for pi.

3
9
27
81
243
729
2187
6561
19683
59049
177147
531441
1594323
4782969
14348907
43046721

3.1
9.61
29.791
92.3521
286.29151
887.503681
2751.261411
8528.910374
26439.62216
81962.8287
254084.769
787662.7838
2441754.63
7569439.352
23465261.99
72742312.17

3.141592654
9.869604401
31.00627668
97.40909103
306.0196848
961.3891936
3020.293228
9488.531016
29809.09933
93648.04748
294204.018
924269.1815
2903677.271
9122171.182
28658145.97
90032220.84


I did these in excel with the last set of calculations using the actual pi() function in excel (which obviously shows decimal truncations).


So, **in general**, the more precision you start with, the better your final answer (depending on a host of other things I forget from my numerical computation class), but you pay for it with computation time. But I'm sure I'm not telling you guys anything new.

Just out of curiosity, was the Einstein app ever run in double precision and compared to results of single precision calculations? I presume it was based on "does not need", but I'd be interested to know the difference.



At the moment the higher validation failure rate for 6900 series cards is just an observation of correlation, no claim of causality :-), as the number of cards on the Albert@Home project is just too small. It could be an indirect effect, e.g. the FFT lib could choose to switch to a different, but less accurate, code path on 6900 cards because of differences in the runtime characteristics. We'll look into it. Any experience wrt this from other projects is welcome.


All my above hot air aside, I could have sworn I remember reading somewhere about the accuracy of OpenCL results and a statement to the effect of "it seems AMD has ditched some precision in lieu of speed", however I thought that was rectified with new catalyst drivers. Maybe send a PM to Raistmer on the Seti@Home Beta boards. I'm more than positive he will know (I think he's the one who originally posted it).
6) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112083)
Posted 14 May 2012 by Infusioned
Post:
Wow. I find that very strange as the 69xx series cards are double precision vs. the single precision of the NVIDIA and single precision AMD (54xx-57xx, 63xx-68xx, 73xx-76xx) cards.

At Milkyway double precision cards are required. I haven't had any validation errors with my 6950.


http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units
7) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112075)
Posted 12 May 2012 by Infusioned
Post:
These wu's show BRPCUDA32 v1.25 throwing errors:

http://albert.phys.uwm.edu/workunit.php?wuid=69412
http://albert.phys.uwm.edu/workunit.php?wuid=70631
http://albert.phys.uwm.edu/workunit.php?wuid=70986
http://albert.phys.uwm.edu/workunit.php?wuid=71008

Most of which are from the same host (GTX 480), with one from this host (GTX285).

<core_client_version>7.0.25</core_client_version>
<![CDATA[
<message>
Cannot create a symbolic link in a registry key that already has subkeys or values. (0x3fc) - exit code 1020 (0x3fc)
</message>
<stderr_txt>
Activated exception handling...
[08:07:13][4260][INFO ] Starting data processing...
[08:07:13][4260][ERROR] Couldn't initialize CUDA driver API (error: 100)!
[08:07:13][4260][ERROR] Demodulation failed (error: 1020)!
08:07:13 (4260): called boinc_finish

</stderr_txt>
]]>




Also, the BRPSSE3 v1.22 client is throwing errors:

http://albert.phys.uwm.edu/workunit.php?wuid=70871
http://albert.phys.uwm.edu/workunit.php?wuid=70837

(from the same host)

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
]]>
8) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112046)
Posted 6 May 2012 by Infusioned
Post:
This wu seems to be wreaking havoc. I completed it ok, but everyone is erroring out. Your client erorred too Bikeman, but I presume that is because you client is 6.12.33?

http://albert.phys.uwm.edu/workunit.php?wuid=69493

...



Seems to be the same types of problems with this wu also:

http://albert.phys.uwm.edu/workunit.php?wuid=69486
9) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112045)
Posted 6 May 2012 by Infusioned
Post:
This wu seems to be wreaking havoc. I completed it ok, but everyone is erroring out. Your client erorred too Bikeman, but I presume that is because you client is 6.12.33?

http://albert.phys.uwm.edu/workunit.php?wuid=69493



So far:

atiOpenCL: (mine)
Completed ok.


atiOpenCL:
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
P�i odstra�ov�n� transformace barev do�lo k chyb�. (0x7e3) - exit code 2019 (0x7e3)
</message>


BRP3Cuda32:
<core_client_version>6.12.33</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>


atiOpenCL:
<core_client_version>7.0.26</core_client_version>
<![CDATA[
<message>
P�i odstra�ov�n� transformace barev do�lo k chyb�. (0x7e3) - exit code 2019 (0x7e3)
</message>
<stderr_txt>
10) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112042)
Posted 5 May 2012 by Infusioned
Post:
p2030.20110421.G41.18+00.30.N.b6s0g0.00000_1400_4 using einsteinbinary_BRP4 version 123 (atiOpenCL)

http://img842.imageshack.us/img842/3608/b6s0g00000014004.jpg
11) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112039)
Posted 5 May 2012 by Infusioned
Post:
p2030.20110421.G41.18+00.30.N.b6s0g0.00000_1832_2 using einsteinbinary_BRP4 version 123 (atiOpenCL)


CPU usage is up a little (steady at ~16% [.16*4cores = ~64%]), but so is GPU usage (45%). All in all, everything is looking good.

http://img585.imageshack.us/img585/6087/b6s0g00000018322.jpg
12) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112030)
Posted 4 May 2012 by Infusioned
Post:
Two more errors with:

<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

http://albert.phys.uwm.edu/result.php?resultid=199760
http://albert.phys.uwm.edu/result.php?resultid=199762


I just read through the 7.0.27 change log and there is some stuff about trying to address this error. I installed 7.0.27, I'll see if this helps.
13) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112016)
Posted 3 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_2504_0 using einsteinbinary_BRP4 version 123 (atiOpenCL)


http://img140.imageshack.us/img140/4502/b0s0g00000025040.jpg
14) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112015)
Posted 2 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_1928_0 using einsteinbinary_BRP4 version 123 (atiOpenCL)

This one seems to have some weid GPU Load spottiness at ~ the 20% completion mark, but seems to have steadied out at 23% load.

http://img210.imageshack.us/img210/4024/b0s0g00000019280.jpg

Edit:
I take that back, I noticed spottiness again, so I ran the latest 3 versions of GPU-Z side-by-side just to see if there was a bug in one of the versions. There doesn't appear to be as they all report the same load %.

http://img196.imageshack.us/img196/7073/gpuzcomparison.jpg
15) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112014)
Posted 2 May 2012 by Infusioned
Post:
Digging through some of the stderr outputs I notice the atiOpenCl app is doing an awful lot of checkpointing. Curious to see if the cuda app was the same, I looked into one of my wu's:

http://albert.phys.uwm.edu/workunit.php?wuid=68681



My (atiOpenCL) output (abbreviated):

[06:49:19][3424][INFO ] Starting data processing...
[06:49:19][3424][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[06:49:19][3424][INFO ] Using OpenCL device "Cayman" by: Advanced Micro Devices, Inc.
[06:49:19][3424][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[06:49:19][3424][INFO ] Header contents:
------> Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.40
...
[06:50:25][3424][INFO ] Checkpoint committed!
[06:51:30][3424][INFO ] Checkpoint committed!
[06:52:35][3424][INFO ] Checkpoint committed!
[06:53:41][3424][INFO ] Checkpoint committed!
[06:54:46][3424][INFO ] Checkpoint committed!
[06:55:52][3424][INFO ] Checkpoint committed!
[06:56:58][3424][INFO ] Checkpoint committed!
[06:58:03][3424][INFO ] Checkpoint committed!
[06:59:08][3424][INFO ] Checkpoint committed!
[07:00:15][3424][INFO ] Checkpoint committed!
[07:01:20][3424][INFO ] Checkpoint committed!
[07:02:25][3424][INFO ] Checkpoint committed!
[07:03:30][3424][INFO ] Checkpoint committed!
[07:04:36][3424][INFO ] Checkpoint committed!
[07:05:41][3424][INFO ] Checkpoint committed!
[07:06:47][3424][INFO ] Checkpoint committed!
[07:07:53][3424][INFO ] Checkpoint committed!
[07:08:58][3424][INFO ] Checkpoint committed!
[07:09:25][3424][INFO ] OpenCL shutdown complete!
[07:09:25][3424][INFO ] Data processing finished successfully!
...


And then repeats the process for:

Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.50
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.60
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.70
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.80
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.90
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM127.00
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM127.10

Checkpointing each WAPP file once per minute, 20 times.



Comparing to the BRP3cuda32 app (abbreviated):

[12:27:01][5004][INFO ] Starting data processing...
[12:27:01][5004][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 218 MB (807 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[12:27:01][5004][INFO ] Using CUDA device #0 "GeForce GTX 560" (336 CUDA cores / 1105.44 GFLOPS)
[12:27:01][5004][INFO ] Version of installed CUDA driver: 4020
[12:27:01][5004][INFO ] Version of CUDA driver API used: 3020
[12:27:01][5004][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[12:27:01][5004][INFO ] Header contents:
------> Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.40
...
[12:27:31][5004][INFO ] Checkpoint committed!
[12:28:01][5004][INFO ] Checkpoint committed!
[12:28:31][5004][INFO ] Checkpoint committed!
[12:29:01][5004][INFO ] Checkpoint committed!
[12:29:31][5004][INFO ] Checkpoint committed!
[12:30:02][5004][INFO ] Checkpoint committed!
[12:30:32][5004][INFO ] Checkpoint committed!
[12:31:01][5004][INFO ] Data processing finished successfully!
...


which then also repeats for:

Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.50
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.60
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.70
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.80
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM126.90
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM127.00
Original WAPP file: ./p2030.20110421.G41.29-00.40.S.b0s0g0.00000_DM127.10

Checkpointing each WAPP file once per minute, 5 times.




So, my questions are:
* What is checkpointing? An intermidiate state (variables) save in case calculations get interrupted and you don't have to start over?

* Is the aitOpenCl app checkpointing more? Or is it that the two apps are doing the same amount of work (calcs), and it's just that the CUDA app/GTX 560 is doing more work per unit time and therefore only needs to checkpoint 5 vs. my 20 times?

* Is the GTX 560/CUDA app really 4x (20/5=4) than the HD6950/AtiOpenCl? The 6950 shows 2253 SP GFLOPS vs. the GTX 560 SP GFLOPS of 1088.6.
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

To semi-answer that, GPU Time indicates a 2.503x increase for the GTX560/CUDA vs. the AtiOpenCl/HD6950. The CPU time for the CUDA app is ,however, 4.24x less than that of the OpenCl app. Anandtech Bench shows the 2500k vs. my AMD 975BE to be slightly better in single-threaded, multi-threaded, and total MIPS (7-Zip test), but nothing earth shattering.
http://www.anandtech.com/bench/Product/288?vs=435

I know you said before that the OpenCl app uses way more CPU than the CUDA app. Perhaps the OpenCl standard is still yet immature, AMD has crappy drivers, or a mix of both? Regardless, I really commend everyone's efforts. Having done a fair bit of coding myself, I know what a pain this can all be.
16) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112013)
Posted 2 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_1920_1 using einsteinbinary_BRP4 version 123 (atiOpenCL)


http://img96.imageshack.us/img96/6813/b0s0g00000019201.jpg
17) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112009)
Posted 2 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_1504_1 using einsteinbinary_BRP4 version 123 (atiOpenCL)


http://img15.imageshack.us/img15/3065/b0s0g00000015041.jpg
18) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112008)
Posted 2 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_1728_0


For some reason this wu is showing 0% GPU load and 25% CPU load. My initial reaction was that this must be an error, however, you can see the GPU clock was down to 725 from 840.

http://img140.imageshack.us/img140/883/b0s0g00000017280.jpg
19) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112006)
Posted 2 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_1264_1


http://img809.imageshack.us/img809/154/b0s0g00000012641.jpg
20) Message boards : Problems and Bug Reports : [New release] BRP app v1.23/1.24 (OpenCL) feedback thread (Message 112003)
Posted 2 May 2012 by Infusioned
Post:
p2030.20110421.G41.29-00.40.S.b0s0g0.00000_744_0 using einsteinbinary_BRP4 version 123 (atiOpenCL)


GPU-Z & Task Manager:
http://img7.imageshack.us/img7/7159/p203020110421g41290040s.jpg


Next 20



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration