Hyperz, it was a long time ago

There is out there a few wrong assumptions on open source and how AMD helps open source developer. Claims that have been made are so far from what's happening that i feel i need to state what i believe should be obvious. AMD has helped a lot on implementing the hyperz feature, they provided me with a lot of information, much more that i probably ever need. So it's not for lack of help from AMD that i am failing at finding solution to my hyperz issue.

I think it all stems from the fact that people believe everything is documented when it comes to GPU and that GPU manufacturer have this amazing documentation that tell you exactly what to do. While they is obviously more detail documentation made by the hardware engineer to tell about all the trick the software engineer can use. There is, however something that sometimes is not always documented, hardware bugs/issues/errata pick the word you like.

Most of the hardware errata are properly documented, at least i hope so, but sometimes they are not, either because the closed source driver team never run into the issue because they are doing things sufficiently differently than us that they never end up in the same spot we do. Or simply because the engineer that figured out the issue forgot to fill an errata for all the kind of reason an human can forgot about doing something. You might object that sending mail to the engineering team and we should get answer, well may i remind you that r6xx family was release in june 2007 and by june 2007 the closed source driver was pretty much done and probably 95% of the hardware errata was fix in the driver already. That means that asking question on r6xx generation to engineer, is asking question about something that is more than 5 years old in their mind. I would not blame any of the engineer to not remember much about it.
 
So what happen with hyperz is a simple story. I started probably in january looking at it. First issue i stumble upon was some kind of checkboard corruption, there was an errata with exactly that symptons AMD told me about it but the solution did not help. Thus i started looking at fglrx, i capture one frame of fglrx with hyperz and try to replay it on the open source driver but still the checkboard issue. Obviously the fglrx was setting up the GPU in a different way than we did, this is what triggered the investigation into the backend setup that produced a patch couple month ago that fixed that and also gave performance improvement.

After the setup issue being clear, i got back to hyperz and stumble upon more issue than i care to remember, the patch history will probably highlight the biggest one. Again all along the way AMD provided me with all informations they had regarding issue i was facing. But no matter how much i followed the AMD documentation advice, i still run into issue. I went back to look at what fglrx was doing and of course i found several things that i believe was no were documented, such as never reset htile preloading if resetting same surface, or first depth clear can't be a fast clear because you need to initialize the htile surface. Maybe i just missread or missunderstand documentation i was provided and i apologize if so.

In the end, from a register value point of view in each use case my patch now pretty much exactly match the register value fglrx uses. Yet on some specific use case i am still hitting lockup. So i am left with little option here, either i am missing a single bit somewhere (despite my automatic command stream comparison i might still miss thing). Or the order in which you do thing matter much more that what we believe ie you need to program some register in some specific order to avoid issues. I believe this is the issue i am left with, but trying to match fglrx order means huge overhaul of how r600g build its command stream.

So the fact is, in the end the closed source driver is the reference implementation that got all the informations in it. So looking at closed source driver command stream is always the saffest way to be sure to have all the informations. That's at least my opinion.

Weeks in the life of GPU driver developer

Sometime you feel like you need to cry out loud the painful process that GPU driver development is. Over the last few weeks i have been working on DP->VGA bridge (named nutmeg) for AMD A serie integrated GPU. Up until now the VGA output of anything using this nutmeg bridge never light up, of course all the documentation we had told us that we were doing things properly, but when the monitor desperetly stays black, you have to face the true, it doesn't work.

So i went on a journey, when working on GPU, journey always turns up in some kind of heroic quest. But first a bit of background on how you get a picture on the monitor :

framebuffer -> crtc -> encoder -> transmitter -> connector -> monitor

Of course things can go wrong in any or all or some of those stages (see http://www.botchco.com/agd5f/?p=51  if you want a longer description). Display port add more fun to the mix, the main idea behind display port is that you have train the link between the source, your GPU, and the sink, usualy your monitor but in this case the nutmeg bridge that convert the DP to VGA. Link training is one of those things that never go quite right. It's as if it was designed to be able to fail in an unlimited number of way, creativities in failure is a slogan that would fit display port. Don't get me wrong, i prefer display port of DVI or HDMI any time of the day. Display port is just almost perfect, but people designing it must had a bad day when they came up with link training.

So display port link training between the integrated GPU and the nutmeg VGA bridge was failing reliably for me. As usual i spend few days trying all sort of incentation to make it works, bending the display port specification in all possible way, interpreting it in the most non logical and backward way. I was brave and discarded no possibilities.

Because GPU would never accept to surrender to your will easily, all my attempt at fixing the link training were in vain. So it was time for me to go look at what the good old vesa bios was doing. Because the hard true is that vesa was successfull at bringing up VGA. Running the vesa bios in an emulator and catching register read/write was something we usualy did back in the day, so we already have a tool for that. But nowadays register read/write are too verbose, there is too many of them and it's way too tedious to figure out what and why such register have such value.

Before going further i must quickly describe atombios, the poor folks at AMD that work on video bios decided to come up with a simple langage (opcode is allmost limited to jump,nop,delay,mov,test,add,sub,shift,mask) that could be reuse by the driver to perform common task that OEM might need to tweak for their specific design. You see each OEM have some freedom around what kind of component it pair the GPU with. For instance different OEM can choose different DDR chips, each chips has its own timing and needs its own special initialization. For sake of simplicity anything that is specific to an OEM should be hidden behind atombios data or code. Why not using x86 directly ? Well nowadays running x86 real mode code can be tricky and x86 is not the only beast in town. Atombios offer a simple langage for which one can easily write an interpreter.

One of the thing that can be, and should be, share is the modesetting code, this code might need some special tweak by each OEM depending on voltage, frequency of the board, and possibly associated external bridge. So atombios is the perfect place for the modesetting code and it allows the video bios and the driver (wether closed or open source) to reuse the same code.

Here i am, i don't want to trace register read/write of the vesa bios but i would like to trace atombios execution of the vesa bios. You see the x86 vesa bios is allmost only just an atombios interpreter that call and execute various atombios function. What is interesting to know is the arguments the vesa bios is giving to those atombios function. Because if there is a difference between the open source driver and the vesa bios it must be in the argument to the atombios function, or an atombios function we don't call, or one we should not call.

So i needed to  trace the vesa atombios execution and catch the argument it gave to the atombios function. Thing about video bios is that there is no symbol, it's just pure x86 real mode assembler. Nevertheless there must be an entry point to the atombios interpreter in the vesa bios. The entry point offset will be in some of the call x86 opcode. So here i am editing the x86emu and tracing all the call opcode and printing register value and stack at each call. You would think that video bios is small and that there isn't much function in it, but of course it's not. I ended up with several thousand of different call, of all those call ~400 different function (offset). That's where you know you need to think a bit and find a way to beat the machine. 

I want over all the atombios function argument and looked for some i could easily guess the values the vesa bios would use, for instance the video mode size. One more thing about atombios function, they are all identified by an unique index. A given function has a fixed index and you look up its offset inside a table (using the index) in the video bios. The atombios interpreter entry point must take as argument the index of the atombios function and the arguments to the atombios function. Now i knew what kind of number i was looking for. After few (if you ever wanted to meet an euphemism you now did) false positive i finaly spoted this interpreter offset and identified on the stack the arguments to the atombios function.

It wasn't all downhill from that point, i now had a full trace of vesa setting a mode with atombios. I first replayed the trace to make sure that i did capture thing properly and that i got everything, and of course it worked. But this trace was way too big, it had too many call to too many atombios function. I had to trim it down, and so i did. This is the usal start from non working condition, remove some atombios function call, run the trace if it bring up the VGA output, it means this atombios call is not crucial so one can overlook it. Well you get it, refine and repeat over and over until you come up with a minimal trace.

Of course, GPU is a twisted thing, and nothing goes as plan with it. So i started comparing minimal trace with the open source driver atombios execution and parameter and found nothings, or so i thought. You see, most of the atombios function take 1 or 2 dword as arguments, everythings is tightly packed, each field take only the minimun number of bit needed for it. And there was my demise, i only paid attention to the first 2 dwords, but the tiny single bit difference between the open source driver and the vesa bios was in the third dword, hidden from my scrutiny.

Of course i can only blame my self for not being thorough enough, but when you spend hour looking at hexadecimal strings, you just can't help being lazy, at least i can't.

This was the tell of how to fix the VGA output of AMD A integrated GPU (also known as llano) in 3 weeks. You should not worry about my next journey, it will once again prove tedious and frustrating. I also hope that this little story shed some light on the difficulties of modesetting, many people believe that 3D engine is the most complicated piece of the GPU, well modesetting is the most unwilling and failure creative piece of the GPU (for all fairness sometime the monitor helps with broken EDID, narrow minded working frequency, limited tolerance, ...). One last thing about modesetting, without it nothing on the screen ... that could disappoint people that want to use their computer.

Doom3 demo benchmarking made easy

So i decided to give another shoot at using doom3 as benchmark for my GPU work. As i don't have the full game (shame on me) i have to use the demo but the demo is not usefull for recording and playing back anything. But now things are opensource yeah ! So with git://git.iodoom.org/iodoom3/iodoom3.git and http://people.freedesktop.org/~glisse/0001-doom3-fix-demo-allow-demo-to-replay-demo-file.patch i was able to build doom3 demo on linux :) (scons . TARGET_DEMO=1 NOCURL=1).

Then just uncompress doom3-linux-1.1.1286-demo.x86.run replace gamex86.so by one from iodoom gamex86-demo.so. At that point you can launch the demo start a game and when ever you think it's good time recorddemo mystuff in doom3 console to start recording then stoprecording once you think you have enough frames. To replay your demo doom3-demo +timedemoquit mystuff.

Note you need 1.1.1286 as iodoom have a checksum for this file hardcoded (i haven't try with older demo but skipping the checksum check might work).

So now i have something a little more recent and different from other benchmark i often use (openarena, nexuiz). I wish their were some free advance GL demo with nice content we could use...

R600 gallium shader here you are !

So after battling with shader, thinking my compiler was giving me crap, i noticed that the w component was forced to 0.0 ... well of course now taking vertex input format into account things work. So here it's tri-flat being render using a shiny compiler infrastructure. I also added a todo list in r600_winsys.h (it's big but it's the begining). So now i will finish plugin the state thing so tri-flat is actually flat and not gouraud shaded, then some cleanup in the flush so i just flush when gallium ask for it. Then it's about growing the shader compiler to support more instruction, this should be "easy". I hope to have glxgears soon. Anyway here is a screenshot (i unplugged the clear stuff now that i got the other part working i will soon plug the clear gallium helper stuff).

tri-flat

Oh i forgot to stress that it only works on r7xx because my main computer has a no fan r7xx GPU :) once i got gears working i will make sure that r6xx are working too (or just send me no fan r6xx gpu).

R600 gallium3d grey it's !

So this i pluged my r600 winsys code with a skeleton r600 pipe driver and now i got hw accelerated clear ! And because i love nice color, it's clearing in grey and you don't have choice, it's grey for everybody ! So now it's about plugin a shader compiler so the driver can do somethings usefull like for instance running glxgears. Bottom line, the command stream scheduler ground work is mostly done and functional, i am pretty happy with the design, the pipe driver doesn't have to bother with the cs size or bo accounting and i love that :)

git://anongit.freedesktop.org/~glisse/mesa

R600/R700 Gallium3D winsys emerging

So, you likely noticed i am a crazy blogger with a new post everydays ;)
Today amusement was finishing cleaning up the early state of the r600/r700
winsys API i want to base the r600/r700 gallium3d driver on. There is
still a lot of things to finish, mostly finishing the states splitup.
Once all the states are in a nice structure i will plug this winsys
with a r600/r700 pipe driver but don't hold your breath, first i am
doing this during my free time, second even when we reach this state
i don't think we will be able to walk on Mars or the on Moon.

What matter here, more than the beauty of the code, is the design of the
beast. It's a different approach from what we have been using so far and
i believe it brings enough interesting features. For instance, the pipe
driver won't have to worry about the cs buffer size. We can do advanced
command/states checking so that we don't try to program the GPU into a
stupid states. There is a bunch of others interesting things about this
but i need to save some of my tricks for my FOSDEM talks.

So if you bore and you an R7xx GPU with KMS working then you might want
to test this beautifull software and see a neat grey square on a blue
background, which, oddly enough, i find entertaining, well it's maybe
not as much fun as gears but it's good enough.

(It's a standalone KMS app which runs from a console without X running)
http://cgit.freedesktop.org/~glisse/r600winsys/

Road to upstream

I have been working on porting radeon kernel modesetting to new ttm developed at VMware by Thomas Hellstrom. Just to avoid confusion ttm is used inside the driver the API we expose to userspace is the GEM api. I cleaned up the radeon code along the way and i feel it's now ready for wider range of tester ! Few words on the code itself, there is now a split between old drm path and new kernel modesetting path, i did so in order to avoid breaking old path while also being able to have a clean design & code for new kernel modesetting path. I am quite happy with how the code looks now, of course i am not the only one guilty about all this work Dave Airlie and Alex Deucher did lot of the original modesetting work, Dave also worked hard in last few month on mesa radeon rewrite which has root into Nicolai Haehnle initial work. Big kudos to Maciej Cencora (aka Osiris) for all improvement he done on the code. So here is a screenshot of what you can get with kernel modesetting, a decent ddx and proper mesa (radeon-rewrite branch) :

radeon newttm screenshot

So if you feel adventurous here are things you need:

Kernel:
git clone git://people.freedesktop.org/~glisse/drm-next
cd drm-next
git branch drm-next-radeon origin/drm-next-radeon
git checkout drm-next-radeon
Then usual kernel configuration just enable fbcon,ttm and radeon kernel modesetting.

git clone git://anongit.freedesktop.org/git/mesa/drm
cd drm
git branch modesetting-gem origin/modesetting-gem
git checkout modesetting-gem
./autogen.sh --prefix=/usr --libdir=/usr/lib64
(libdir is only needed if on x86-64)
make
sudo make install

git clone git://people.freedesktop.org/~glisse/xf86-video-ati
cd xf86-video-ati
git branch radeon-gem-cs3 origin/radeon-gem-cs3
git checkout radeon-gem-cs3
./autogen.sh --prefix=/usr --libdir=/usr/lib64
(libdir is only needed if on x86-64)
You also need the Xorg dev package from you distribution sudo yum-builddep xorg-x11-drv-ati.x86_64 (on fedora)
make
sudo make install

git clone git://anongit.freedesktop.org/git/mesa/mesa
cd mesa
git branch radeon-rewrite origin/radeon-rewrite
git checkout radeon-rewrite
./autogen.sh --prefix=/usr --libdir=/usr/lib64 --with-dri-drivers=radeon,r200,r300
(libdir is only needed if on x86-64)
make
sudo make install

Now once you reboot under new kernel you need to load radeon module with modeset=1, then your userspace should be all set, i know this needs many things to build but hopefully soon we could start merging little by little thing upstream and so it should end up in your distro. Anyway if you own an x200 (igp) or x1250/x1200 with intel CPU i would love to know if you got any issues with this code. Early tester are always appreciated and i would like to thanks spstarr, dilex, osiris, ... for their early testing and reporting on this code.

Forget to mention that suspend/resume should work flawlessly even if you are in a middle of quake3 game while spinning compiz cube (don't think it's a normal use case but you never know).

Of course performance are not yet what it used to be, but here is my todo list which should also bring us back somewhere near old performance and maybe even outperform old path :
-erase memory on bo allocation (security needed before upstreaming)
-more check on command stream (security needed before upstreaming)
-port & cleanup Dave's page allocator to avoid heavy cache flushing (performances)
-buffer swapping (performances)
-buffer tiling (performances)

radeon DRI2 and compiz: I love it when a plan comes together !

The plan was to quickly hackup DRI2 inside mesa to test new command stream submission and memory manager. Many things were lacking to get compiz working, add bugs to the equation and i end up way beyond my schedule :( Anyway today i am allmost there and i got compiz working =) i need to thank Kristian Høgsberg for all the help and advices on getting dri2/compiz working, Dave Airlie also helped me in the painfull process of tracking bugs. Sadly they are still a bunch of issues with today, as you can see on the screenshot my dri2 is still rendering to front buffer... i need to sort this out, i also stumble on a bug in the idr stuff in gem, need to track down this one too.

The biggest task is to convert r100 & r200 driver to new buffer object and command stream scheme, then we need to do regression testing and in the end, hopefully, we can merge with mesa. That's for userspace, but the tedious bits are the kernel one, they are many things that need to be cleaned up, and many other pieces that need to be written like a full command checker (summary is we need to check that command userspace wants to executed and gpu won't access unauthorized memory area or won't have access to previous memory content, doing this is not easy and need a lot of code). Also their is few aspects in the today memory manager that i don't like, fence and buffer object movement being the top two items i can think of, hopefully TTM work that is underway will obsolete my concerns. To sumup it's pretty useless to have userspace if kernel space is not upstream :)



You can see that the gears are at the right position, note also that you don't see the uncomposited window in live experience, only the screenshot exhibit this uncomposited content. For curious people code is at the same place as before :

git://anongit.freedesktop.org/mesa/drm modesetting-gem
git://people.freedesktop.org/~glisse/xf86-video-ati radeon-gem-cs-dri2
git://people.freedesktop.org/~glisse/mesa r300-bo-cs

Radeon on the way to DRI2

Hey,

It has been a while since my last post :) So i have spending sometimes this week on DRI2 and new command stream submission for radeon (well to be accurate i only done r300 but much of the work should apply to others asics as well). This new command stream submission is here to replace the old one which was basically a wrapper around the hardware format, now we directly use hardware format so no more translation, every one speak the same language. The new command stream format is also designed to handle what we call relocation, with memory manager only the kernel side know what is the hardware address of memory object so the kernel as to change command which reference a memory object and write the hardware address to the command stream before sending the command stream to the hardware. In all this, DRI2 is just a bonus, low hanging fruit i wanted to taste (i can be greedy sometimes :)). It's not ready at all yet, as it suffers from a massive slowdown (expect to see one frame every minute :o)). Maybe i will have more stuff to blog about in coming weeks.

radeon DRI2 gears

Oh for curious people you need the following (giturl branchname) :
git://anongit.freedesktop.org/mesa/drm modesetting-gem
git://people.freedesktop.org/~glisse/xf86-video-ati radeon-gem-cs-dri2
git://people.freedesktop.org/~glisse/mesa r300-bo-cs
None of this is supported i tend to break things and delete branch once in a while