Visual Studio Express 2012 for Windows Desktop

September 12, 2012, 10:00 am

≫ Next: C++/CX Part 2 of [n]: Types That Wear Hats

≪ Previous: C++/CX Part 1 of [n]: A Simple Class

As you may have seen, Soma announced today that Visual Studio Express 2012 for Windows Desktop is now available for download. For C++ developers, the Express for Windows Desktop includes many of the new C++ investments we made in Visual Studio 2012, including C++ AMP, improvements to C++ 11 Standards conformance, improvements to the compiler and linker, and the IDE. It also includes the 64-bit cross-compiler and 64-bit C++ libraries, the C++ unit testing introduced in Visual Studio 2012, and a targeted set of code analysis rules to help you build more reliable applications. And while we’re still working on the update that will enable targeting Windows XP, it will be supported with Express for Windows Desktop when it’s available. Head on over to this post on the Visual Studio Team blog to join the discussion about what’s new in Express for desktop development.

Jennifer Leaf

Senior Program Manager

Visual C++

↧

C++/CX Part 2 of [n]: Types That Wear Hats

September 17, 2012, 10:00 am

≫ Next: Casablanca at TechEd Australia

≪ Previous: Visual Studio Express 2012 for Windows Desktop

See C++/CX Part 0 of [N]: An Introduction for an introduction to this series.

The hat (^) is one of the most prominent features of C++/CX--it's hard not to notice it when one first sees C++/CX code. So, what exactly is a ^ type? A hat type is a smart pointer type that (1) automatically manages the lifetime of a Windows Runtime object and (2) provides automatic type conversion capabilities to simplify use of Windows Runtime objects.

We'll start off by discussing how Windows Runtime objects are used via WRL, then explain how the C++/CX hat works to make things simpler. For demonstration purposes, we'll use the following modified version of the Number class that we introduced in Part 1:

    public interface struct IGetValue
    {
        int GetValue() = 0;
    };

    public interface struct ISetValue
    {
        void SetValue(int value) = 0;
    };

 
    public ref class Number sealed : public IGetValue, ISetValue
    {
    public:
        Number() : _value(0) { }

 
        virtual int  GetValue()          { return _value;  }
        virtual void SetValue(int value) { _value = value; }

 
    private:

        int _value;
    };

In this modified Number implementation, we define a pair of interfaces, IGetValue and ISetValue, that declare the two member functions of Number; Number then implements these two interfaces. Otherwise, things should look pretty familiar.

Note that Number actually implements three Windows Runtime interfaces: in addition to IGetValue and ISetValue, the compiler still generates the __INumberPublicNonVirtuals interface, which Number implements. Because all of the members of Number are declared by explicitly implemented interfaces (IGetValue and ISetValue), the compiler-generated __INumberPublicNonVirtuals does not declare any members. This interface is still required, though, as it is the default interface for the Number type. Every runtime type must have one default interface, and the default interface should almost always be unique to the class. We'll see why the default interface is important a bit later.

Lifetime Management

Windows Runtime reference types use reference counting for object lifetime management. All Windows Runtime interfaces (including all three of the interfaces implemented by Number) derive directly from the IInspectable interface, which itself derives from the COM IUnknown interface. IUnknown declares three member functions that are used to control the lifetime of an object and to allow type conversion.

The MSDN article "Rules for Managing Reference Counts" has a thorough overview of how IUnknown lifetime management works. The principles are quite straightforward, though: whenever you create a new reference to an object, you must call IUnknown::AddRef to increment its reference count; whenever you "destroy" a reference to an object, you must call IUnknown::Release to decrement the reference count. The reference count is initialized to zero, and after a series of calls to AddRef and Release, when the reference count reaches zero again, the object destroys itself.

Of course, when programming in C++, we should rarely--practically never--call AddRef and Release directly. Instead, we should prefer wherever possible to use a smart pointer that automatically makes these calls when they are required. Use of a smart pointer helps to ensure that objects are neither leaked due to a missed Release nor prematurely destroyed due to a premature Release or failure to AddRef.

ATL includes CComPtr and a family of related smart pointers that have long been used in COM programming for automatically managing the reference counting of objects that implement IUnknown. WRL includes ComPtr, which is an improved and modernized CComPtr (an example improvement: ComPtr does not overload the unary & like CComPtr does).

For those who have not done much COM programming and are unfamiliar with ComPtrs: if you've used shared_ptr (included as part of C++11, C++ TR1, and Boost), ComPtr has effectively the same behavior with respect to lifetime management. The mechanism is different (ComPtr makes use of the internal reference counting provided by IUnknown, while shared_ptr supports arbitrary types and must thus use an external reference count), but the lifetime management behavior is the same.

The C++/CX hat has exactly the same lifetime management semantics as a ComPtr. When a T^ is copied, AddRef is called to increment the reference count, and when a T^ goes out of scope or is reassigned, Release is called to decrement the reference count. We can consider a simple example to demonstrate the reference counting behavior:

    {
        T^ t0 = ref new A();
        T^ t1 = ref new B();

        t0 = t1;
        t0 = nullptr;
    }

First, we create an A object and give ownership of it to t0. The reference count of this A object is 1 because there is one T^ that has a reference to it. We then create a B object and give ownership of it to t1. The reference count of this B object is also 1.

The end result of t0 = t1 is that both t0 and t1 point to the same object. This must be done in three steps. First, t1->AddRef() is called to increment the reference count of the B object, because t1 is gaining ownership of the object. Second, t0->Release() is called to release t0's ownership of the A object. This causes the reference count of the A object to drop to zero and the A object destroys itself. This causes the reference count of the B object to increase to 2. Third and finally, t1 is set to point to the B object.

We then assign t0 = nullptr. This "resets" t0 to be null, which causes it to release its ownership of the B object. This calls t0->Release(), causing the reference count of the B object to decrease to 1.

Finally, execution will reach the closing brace of the block: }. At this point, all of the local variables are destroyed, in reverse order. First, t1 is destroyed (the smart pointer, not the object pointed to). This calls t1->Release(), causing the reference count of the B object to drop to zero, so the B object destroys itself. t0 is then destroyed, which is a no-op because it is null.

If lifetime management was our only concern, there wouldn't really be a need for the ^ at all: ComPtr<T> is sufficient to manage object lifetime.

Type Conversion

In C++, some type conversions involving class types are implicit; others may be performed using a cast or a series of casts. For example, if Number and the interfaces it implements were ordinary C++ types and not Windows Runtime types, conversion from Number* to IGetValue* would be implicit, and we could convert from IGetValue* to Number* using a static_cast or a dynamic_cast.

These conversions do not work with Windows Runtime reference types because the implementation of a reference type is opaque and the layout of a reference type in memory is left unspecified. A reference type implemented in C# may be laid out in memory differently from an equivalent reference type implemented in C++. Thus, we cannot rely on C++ language specific features like implicit derived-to-base conversions and casts when working directly with Windows Runtime types.

To perform these conversions, we must instead use the third member function of the IUnknown interface: IUnknown::QueryInterface. This member function can be considered as a language-neutral dynamic_cast: it attempts to perform a conversion to the specified interface and returns whether the conversion succeeded. Because each runtime type implements the IUnknown interface and provides its own definition for QueryInterface, it can do whatever is required to get the correct interface pointer in the language and framework with which it is implemented.

Using an object via WRL

Let's take a look at how we'd use our Number class via WRL. This example accepts an instance of Number and calls SetValue to set the value and GetValue to get the value back. (Error checking has been omitted for brevity.)

    void F(ComPtr<__INumberPublicNonVirtuals> const& numberIf)
    {
        // Get a pointer to the object's ISetValue interface and set the value:
        ComPtr<ISetValue> setValueIf;
        numberIf.As(&setValueIf);
    
        setValueIf->SetValue(42);
    
        // Get a pointer to the object's IGetValue interface and get the value:
        ComPtr<IGetValue> getValueIf;
        numberIf.As(&getValueIf);
    
        int value = 0;
        getValueIf->GetValue(&value);
    }

The As member function template of the WRL ComPtr simply encapsulates a call to IUnknown::QueryInterface in a way that helps to prevent common programming errors. We use this first to get the ISetValue interface pointer to call SetValue then again to get the IGetValue interface pointer to call GetValue.

This would be a lot simpler if we obtained a Number* and called SetValue and GetValue via that pointer. Unfortunately, we can't do that: recall that the implementation of a reference type is opaque and that we only ever interact with an object via a pointer to one of the interfaces that it implements. This means that we can never have a Number*; there is really no such thing. Rather, we can only refer to a Number object via an IGetValue*, an ISetValue*, or an __INumberPublicNonVirtuals*.

This is an awful lot of code just to call two member functions and this example demonstrates one of the key hurdles that we have to overcome to make it easier to use Windows Runtime types. Unlike COM, Windows Runtime does not allow an interface to derive from another Windows Runtime interface; all interfaces must derive directly from IInspectable. Each interface is independent, and we can only interact with an object via interfaces, so if we are using a type that implements multiple interfaces (which many types do), we are stuck having to write a lot of rather verbose type conversion code so that we can obtain the correct interface pointer to make each function call.

Using an object via C++/CX

One of the key advantages of C++/CX is that the compiler knows which types are Windows Runtime types. It has access to the Windows Metadata (WinMD) file that defines each interface and runtime type, so--among other things--it knows the set of interfaces that each runtime type implements. For instance, the compiler knows that the Number type implements the ISetValue and IGetValue interfaces, because the metadata specifies that it does. The compiler is able to use this type information to automatically generate type conversion code.

Consider the following C++/CX example, which is equivalent to the WRL example that we presented:

    void F(Number^ number)
    {
        ISetValue^ setValueIf = number;
        setValueIf->SetValue(42);
    
        IGetValue^ getValueIf = number;
        int value = getValueIf->GetValue();
    }

Because the compiler knows that the Number type implements the ISetValue and IGetValue interfaces, it allows implicit conversion from Number^ to ISetValue^ and IGetValue^. This implicit conversion causes the compiler to generate a call to IUnknown::QueryInterface to get the right interface pointer. Aside from the cleaner syntax, there's really no magic here: the compiler is just generating the type conversion code that we otherwise would have to write ourselves.

dynamic_cast also works much as we'd expect: we can, for example, amend this example to obtain the IGetValue^ from the ISetValue^:

    void F(Number^ number)
    {
        ISetValue^ setValueIf = number;
        setValueIf->SetValue(42);
    
        IGetValue^ getValueIf = dynamic_cast<IGetValue^>(setValueIf);
        int value = getValueIf->GetValue();
    }

This example has the same behavior as the first example, we just take different steps to get that same behavior. dynamic_cast can return nullptr if the cast fails (though we know that it will succeed in this specific case). C++/CX also provides safe_cast, which throws a Platform::InvalidCastException exception if the cast fails.

When we discussed the WRL example above, we noted that there's no such thing as a Number*: we only ever work with interface pointers. This begs the question: what is a Number^? At runtime, a Number^ is a __INumberPublicNonVirtuals^. A hat that refers to a runtime type (and not an interface) actually holds a pointer to the default interface of that runtime type.

At compile-time, though, the compiler treats Number^ as if it refers to the whole Number object. The compiler aggregates all of the members of all of the interfaces implemented by Number and allows all of those members to be called directly via a Number^. We can use a Number^ as if it were an IGetValue^ or an ISetValue^ and the compiler will inject the required calls to QueryInterface to perform the conversions required to make the function call.

Therefore, we can shorten our C++/CX program further:

    void F(Number^ number)
    {
        number->SetValue(42);
        int value = number->GetValue();
    }

This code does exactly the same thing as our first C++/CX example and as our WRL example. There's still no magic: the compiler is simply generating all of the boilerplate to perform the type conversion required to make each function call.

You may have noticed that this example is much shorter and less verbose than the WRL example that we started with. :-) All of the boilerplate is gone and we're left with code that--aside from the ^ and ref that tell the compiler we are dealing with Windows Runtime types--looks exactly like similar C++ code that interacts with ordinary C++ types. This is the point, though: ideally our code that uses Windows Runtime types should be as similar as possible to code that uses C++ types.

Final Notes

Both ComPtr<T> and T^ are "no-overhead" smart pointers: the size of each is the same as the size of an ordinary pointer, and operations using them do not do any unnecessary work. If you need to interoperate between code that uses C++/CX and code that uses WRL, you can simply use reinterpret_cast to convert a T^ to a T*:

    ABI::ISetValue* setValuePtr = reinterpret_cast(setValueIf);

(The ABI level definitions of types are defined in namespaces under the ABI namespace so that they do not conflict with the "high-level" definitions of the types used by C++/CX, which are defined under the global namespace.)

In addition to its type conversion capabilities, the hat provides other benefits that could not otherwise be accomplished via use of an ordinary smart pointer like ComPtr. One of the most important of these benefits is that the hat can be used uniformly, everywhere. A member function that takes an interface pointer as an argument is declared as taking a raw pointer (this is part of the Windows Runtime ABI, which is designed to be simple and language-neutral, and thus knows nothing of what a C++ smart pointer is). So, while one can use ComPtr most places, raw pointers still need to be used at the ABI boundary, and there is room for subtle (and not-so-subtle) programming errors.

With C++/CX, the compiler already transforms member function signatures to translate between exceptions and HRESULTs and the compiler is also able to inject conversions from T^ to T* where required, substantially reducing the opportunity for programming errors.

↧

Casablanca at TechEd Australia

September 19, 2012, 3:23 pm

≫ Next: Project Austin Part 1 of 6: Introduction

≪ Previous: C++/CX Part 2 of [n]: Types That Wear Hats

A few days ago, our friends and technology enthusiasts John Azariah and Mahesh Krishnan delivered a great presentation on Casablanca at the TechEd Australia. John and Mahesh go deep - PPL tasks, table and blob storage, Metro client, Azure deployment, and of course, lots of great demos.

Enjoy:

http://channel9.msdn.com/Events/TechEd/Australia/2012/AZR331

In the meantime, the product team in Redmond is preparing a refresh of the DevLabs release. It will include new features, samples, and tutorials. Stay tuned for an announcement on this blog!

Artur Laksberg,
Casablanca Team

↧

Project Austin Part 1 of 6: Introduction

September 20, 2012, 5:39 pm

≫ Next: Download Today: Refreshed Casablanca Bits Available

≪ Previous: Casablanca at TechEd Australia

My name is Jorge Pereira and I am a developer at Microsoft. For the past few months I've been working on a Windows 8 app along with a small team of developers from the Visual C++ team, we call it Project Code Name Austin.

Austin is a digital note-taking app for Windows 8. You can add pages to your notebook, delete them, or move them around. You can use digital ink to write or draw things on those pages. You can add photos from your computer, from SkyDrive, or directly from your computer's camera. You can share the notes you create to other Windows 8 apps such as e-mail or SkyDrive.

When we sat down to create it, we wanted to build a very simple digital replacement to the real paper notebooks people carry around to meetings at work, to school, around the house, where they scribble things and take quick notes.

Another very important goal of this app is to showcase the power of the native platform and C++, and some of the new features in Visual Studio 2012 such as automatic code vectorization and C++ AMP. Austin aims to demonstrate with real code the kind of device-optimized, fluid and responsive user experience that can be built with our newest native tools on the Windows8 platform.

For that reason, we are making the majority of the source code available for download here. We also plan to publish a series of blog posts here in the Visual C++ Team Blog talking about our experience building it, and some of the technologies we used.

Austin doesn't aspire to be a full-featured note-taking app such as OneNote. It doesn't give you a way to organize your notes other than by their position in the book, it also doesn't enable typing or searching. These were all conscious decisions. We believe in the beautiful simplicity of just a pen and a piece of paper, and that's what we tried to recreate with it. Much of the inspiration and code for the Austin app draws from an earlier project code-named Courier.

I've been using Austin for a while for many things. For example, I took a photo of a corner in my garden and then drew right over it the vegetable garden box I planned to build. I use it to make my shopping lists and to write cooking recipes and snap photos as I make them. Sometimes, I find it easier to draw something and email it instead of typing a bunch of text. I also use it to write my thoughts when I am coding.

It's amazing how useful just a pen and a paper are by themselves. But when you take that concept to the computer realm and expand it to do things like add photos and annotate them right on the spot, and digitally share what you create, then the possibilities are endless.

This is a page of notes I wrote while sorting out my thoughts before I rewrote some of Project Austin's camera code:

And this is a sketch of my vegetable garden:

But a video is worth a thousand sketches! We put together this to introduce Austin:

(you can download the video in mp4 format using this link)

Why we built it

We built Austin with two main goals in mind. First, we wanted to build a fully functional real-world app that's actually useful and high quality. Second, we wanted to demonstrate the power of C++ and the Window 8 platform, and showcase some of the new technologies delivered by our team in Visual Studio 2012, such as C++ AMP and automatic code vectorization. We wanted to use DirectX to create an immersive, fluid user interface that's built as a 3D scene with lights, shadows, and a camera so that pages can be viewed from different angles.

A bit of history to help you understand the origins of Project Austin...

C++

When I started working on Austin, I hadn't written C++ code in over 7 years because I was mostly coding in C#. Since one of the goals of Project Austin was to showcase how to write "modern" C++, I spent quite some time reading through several classic and newer C++ books, trying to get an idea about what writing proper, "modern" C++ code is. (I pulled out my Stroustrup book, which had been on my shelf since school!) I also got great help from some of the experts on the team—after all, there are some obvious advantages of being surrounded by the team of people that build the Visual C++ compiler. The result of this was a set of coding guidelines and style that we used throughout the code.

We use the C++ Standard Library extensively, for strings, collections, and smart pointers. We barely have any “naked” C++ pointers in the code; instead we use smart pointer types such as std::shared_ptr and Microsoft::WRL::ComPtr. We use the RAII pattern extensively, we don't check HRESULTs, instead we do most of our error handling using exceptions. We don't explicitly call new or delete when we create or destroy objects. Our coding conventions are inspired by the BOOST library.

Of course, I am sure we do lots of silly things, or things we could do better or differently, but we in the team are all pretty happy with the way the code is written and how productive we are. Despite being a part of the C++ team, I am by no means a C++ expert, so this source code isn't meant to dictate the canonical way to do things in C++. They do represent our effort at using the language the best way that works for us, and in that respect, it's been a great success.

Architecture

Austin's code is conveniently structured with common functions grouped in a library (code named "baja"), inspired by modern modularity design principles. This library started as a basic toolkit that contained facilities such as a call-stack walking, tracing, and asserts. Later it grew to provide a graphics engine, a math library, storage, and wrapping around the underlying operating system for things like input and a main app loop.

Austin's code and libraries are a work in progress. It is a toolkit built by a team building a real app, with all its flaws and shortcomings. It's not an "official" Microsoft library in any way. We are giving away this code with no guarantees or promises about it whatsoever—use it at your own risk! We may keep evolving it and adding or changing functionality.

Technology

Austin is built mostly on C++, and we use a lot of STL and some BOOST. We also use C++/CX to interface with the Windows Runtime and XAML to display some user interface elements. The following is a list of some of, what we think, are the most interesting parts of Austin. We'll be posting more detailed articles on most of these in the upcoming weeks.

3D graphics and user interface

Our graphics engine is built on DirectX. The graphics engine gives us a simple scene graph, cameras, lights, a very simple "effects" library with some basic shaders, materials, fonts, and some simple geometry. Overall, it hides a lot of the DirectX lower-level complexity from the rest of the app.

This picture shows a page in Austin from a different camera perspective than normal. It shows the "viewport" (the blue rectangle), the reference grid, the light source, and the coordinate origin.

As mentioned above, Austin integrates with XAML, which we use for display part of the user interface such as menu fly-outs, buttons and so on. It uses the SwapChainBackgroundPanel to render our 3D scene, which includes the notebook's pages, photos, ink strokes, and background. We use XAML for the settings menu, the app bar, and the rest of the user interface.

Storage

We use the Extensible Storage Engine (ESE) as our storage engine. ESE is a fairly low-level storage engine where you manipulate tables and indices rather than use higher-level SQL statements. ESE is used as the storage engine in many Microsoft products. We found it super-fast and efficient, and it's included in Windows 8.

Ink

We used a home-grown approach to displaying digital ink because we wanted to draw ink on 3D pages, and we wanted absolute control over it so that we could eventually do things like bleed the ink onto the paper and change its lighting based on the paper grain. The initial implementation of the ink used a geometry shader to generate the mesh, but we ended up moving that code to the CPU when the logic became a bit too complex for my limited HLSL coding skills. So currently, the mesh is generated in the CPU, then the vertex buffers are sent to the GPU for rendering using a pixel shader.

This picture shows our first (geometry shader-based) ink implementations. Notice the mesh that forms the ink stroke:

Ink smoothing

Another interesting feature of Austin is its ink smoothing. As the user moves the stylus, finger, or mouse pointer over the screen, the points, pressures, and velocities are captured and eventually make it to our app. We use these points to generate the mesh that forms an ink stroke. However, if the user moves the stylus very fast, the distance between these points can be noticeable, and the ink stroke shows straight sections that make it look not that great. In this example you can see the difference, especially around the sharper corners of the stroke. It's subtle but it makes a difference. This code is also vectorized for an extra performance boost.

Navigation and scalability

Another important feature we wanted to give Austin is the ability to browse through many pages of the notebook in a fast and scalable way. We put a great deal of effort into this—loading the data from ESE asynchronously in multiple threads using the Parallel Patterns Library (PPL), generating page thumbnail textures, caching some of the view data, and using DirectX deferred rendering to avoid interrupting the UI thread and causing a jarring input experience.

Paper simulation

Austin gives you several ways to navigate your pages. You can view them in a 3-row grid or in a single-row list, or you can view them "stacked" on top of each other. In this third mode, when you swipe across the screen, the page reacts and turns like an actual physical page. Getting this page curling right was also an interesting piece of work. We started by using a physics engine to try to simulate the paper but it ended up looking too much like cloth. Eventually, we wrote some code inspired by [1] that wraps the paper around a "virtual", invisible cone that changes form as the user swipes the finger across the screen. The results of this much simpler approach look great.

This picture shows how a page looks like when curled:

C++ AMP

When the paper mesh is deformed, we need to calculate the normal vectors of its vertices so we can apply the shading properly. We used C++ AMP to write code that runs on the CPU and also on the GPU. (The WARP or reference device runs C++ AMP code when a compatible GPU is not available.) The performance boost was fantastic. We plan to re-write more parts of Austin using C++ AMP.

Shadows and post-processing effects

One last interesting piece of Austin is the way it renders its shadows. I am not a very experienced graphics programmer so I had to put a lot of effort into making this look good while having a good performance. The pages cast shadows on to the background and then we apply some post-processing filters in the GPU to make things even more interesting. The results look great and they are super-fast.

This picture shows a few pages in Austin, casting shadows on the background. There's a Gaussian blur filter applied on the shadows, as well as a radial darken filter on the entire background to give it more depth.

And that's all for now! We will be writing more about Austin and Project Baja because we want to share our experience building this app with everyone.

References

[1] L. Hong, S.K. Card, and J. Chen, "Turning Pages of 3D Electronic Books", in Proc. 3DUI, 2006, pp.159-165.

↧

Download Today: Refreshed Casablanca Bits Available

September 24, 2012, 3:24 pm

≫ Next: DirectX Graphics Development with Visual Studio 2012

≪ Previous: Project Austin Part 1 of 6: Introduction

Back, at the end of April, we announced our first release of Casablanca as an incubation project on Devlabs. Since then, we are glad to have received a positive response from the C++ community. At the end of June, we refreshed the bits for support of Visual Studio 2012 RC and Windows 8 RP. Those builds are now rather long in the tooth, and many have been asking for a refresh for the most recent builds, which are the final versions of Visual Studio 2012 and Windows 8.

Today, we are announcing that this long awaited refresh is available for download at the Casablanca devlabs site. In addition to refreshing and fixing lots of bugs, we are also adding a few new capabilities:

Streaming of HTTP message bodies (there are some limitations on Windows 8 App Store apps)
Improved asynchronous streams libraries, including interop with C++ iostreams.
Support for Azure Table storage, with some restrictions.
Complete support for the Azure blob and queue APIs.
We cleaned up our use of strings as arguments to various APIs. There should be no more confusion between wstring and string.
A couple of new samples, mostly focusing on Azure storage capabilities.

We’re hoping to make more frequent updates in the future, and will soon put this project on a path toward release as a supported product under an OSS license.

Please use the devlabs forums to provide us with feedback.

Thank you for your support and we look forward to hearing back from you!

The Casablanca Team

↧

DirectX Graphics Development with Visual Studio 2012

September 25, 2012, 1:15 pm

≫ Next: Project Austin Part 2 of 6: Page Curling

≪ Previous: Download Today: Refreshed Casablanca Bits Available

Visual Studio 2012 includes several new features for developing and debugging applications that use DirectX. Here are links to references and resources so you can get started with these new features.

Getting Started

You can write and build apps that use DirectX with Visual Studio Express 2012 for Windows 8 or Visual Studio Express 2012 for Windows Desktop, or any of the retail versions of Visual Studio 2012 (Professional, Premium, and Ultimate). You also don’t need a separate DirectX SDK download – the DirectX SDK is now part of the Windows SDK, and the Windows 8 SDK is included in Visual Studio 2012.

You do need a retail version of Visual Studio 2012 to use the Visual Studio Graphics Debugging and Graphics Asset tools described later in this post.

If you have projects that were using the DirectX SDK, check out “Where is the DirectX SDK?” to learn how to use these projects with Visual Studio 2012.

Samples

There are many DirectX samples on the MSDN Samples Gallery, for both Windows Store and desktop apps. You can also search for and download samples directly from the New Project window in Visual Studio 2012.

Resources for using DirectX in Windows Store Apps

Not only does Visual Studio include the same DirectX support for the new Windows Store apps as it does for desktop apps, you can also combine XAML and DirectX in the same Windows Store app.

Compiling and using HLSL files

Visual Studio now includes support for HLSL files in the IDE, including syntax coloring, indenting, and outlining. We also support using the HLSL compiler (FXC.exe) with MSBuild so you can easily compile your HLSL files into .cso (compiled shader output) format. You can configure the compiler settings on a per-file basis through Property Pages.

Visual Studio DirectX Graphics Diagnostics

The DirectX Graphics Diagnostics tools help you diagnose and debug DirectX rendering issues by analyzing frames captured into a log file. These tools integrate some of the functionality from the PIX for Windows tool that was part of the DirectX SDK. To debug apps running on tablets or other devices which don’t have Visual Studio installed, you can capture frames programmatically and then open the logs in Visual Studio to debug after the fact.

Tools for Graphics Assets

As a developer, wouldn’t it be great to look at graphics assets such as textures or 3D models without needing to compile them into your game or app and run it? Or to see the textures or models while you’re using the Graphics Diagnostics tools to visually understand what’s being (or not being) rendered? Visual Studio 2012 includes tools that allow you to view graphics assets directly from the IDE. You can also use the Shader Designer to create shaders using a visual tool, so you can visualize what the shader will do as you’re designing it.

Videos

Feedback

We would love to hear from you about the graphics tools in Visual Studio! To report bugs, please use the Visual Studio Connect site. The Visual Studio UserVoice site is the best place to submit suggestions and ideas for future releases.

Still more that you want to know? Leave us feedback in the comments.

↧

Project Austin Part 2 of 6: Page Curling

September 27, 2012, 10:13 am

≫ Next: C++ Runtime for Windows 8 Store apps

≪ Previous: DirectX Graphics Development with Visual Studio 2012

Hi, my name is Eric Brumer. I’m a developer on the C++ compiler optimizer, but I’ve spent some time working on Project Code Name Austin to help showcase the power and performance of C++ in a real-world program. For a general overview of the project, please check out the original blog post. The source code for Austin, including the bits specific to page curling described here, can be downloaded on CodePlex.

In this blog post, I’ll explain how we implemented page turning in the “Full page” viewing mode. We wanted to make flipping through the pages in Austin to feel like flipping through pages in a real book. To that end, we built on some existing published work to achieve performant and realistic page curling.

Before going further, take a look at a video of page curling in action!

(you can download the video in mp4 format using this link)

Realistic page curling

A brilliant paper by Hong et. al. called “Turning Pages of 3D Electronic Books” claims that turning a page of a physical book can be simulated as deforming a page around a cone. See [1] for the details.

Here’s a (poorly drawn) diagram to help explain the concept in the paper. The flat sheet of paper is deformed around the cone to simulate curling. By changing the shape and position of the cone you can simulate more or less curling.

Similarly, you can also curl a flat sheet of paper around a cylinder. Here’s another (poorly drawn) diagram to help explain that concept.

To simulate curling, we use a combination of curling around a cone and curling around a cylinder:

If the user is trying to curl from the top-right of the page, we simulate pinching the top right corner of a piece of paper by deforming around a cone.
If the user is trying to curl from the center-right of the page, we simulate pinching the center of a piece of paper by deforming around a cylinder.
If the user is trying to curl from the bottom-right of the page, we simulate pinching the bottom right corner of a piece of paper by deforming around the cone flipped upside down.

Anywhere in between and we use a combination of cone & cylinder deforming.

Some geometry

Here are the details to transform a page around a cylinder. There is similar geometry to transform a page to a cylinder described in [1]. Given the point P_flat with coordinates {x₁, y₁, z₁ = 0} of a flat page, we want to transform it into P_curl with coordinates {x₂, y₂, z₂} the point on a cylinder with radius r that is lying on the ‘spine’ of the book. Consider the following diagram. Note the x & z axes (the y axis is in & out of your computer screen). Also keep in mind I am representing the flat paper & cylinder using the same colors as in the diagrams above.

The key insight is that the distance from the origin to P_flat (x₁) is the same arc distance as from the origin to P_curl along the cylinder. Then, from simple geometry, we can say that β = x₁ / r. Now, to get P_curl, we take the origin, move it down by ‘r’ on the z axis, rotate about β, then move it up by ‘r’ on the z axis. So, the math ends up being:

The above equations compute P_curl by wrapping a flat page around cylinder. [1] contains the equations to compute a different P_curl by wrapping a flat page around a cone. Once we compute both P_curlvalues, we combine the results based on where the user is trying to curl the page. Lastly, after we have computed the two curled points, we rotate the entire page about the spine of the book.

The specific parameters are tuned by hand: the cone parameters, the cylinder width, and the rotation about the spine of the book.

Code

The source code for Austin, including the bits specific to page curling described here, can be downloaded on CodePlex. The page curling transformation is done in journal/views/page_curl.cpp, specifically in page_curl::curlPage(). The rest of the code in that file is to handle uncurling pages (forwards or backwards) when the user lifts their finger off the screen. I'm omitting some important details, but this code gives the rough idea.

    for (b::int32 j = 0; j < jMax; j++)
    {
        ...

        for (b::int32 i = 0; i < iMax; i++)
        {
            {load up x, y, z=0}

            float coneX = x;
            float coneY = y;
            float coneZ = z;
            {
                // Compute conical parameters coneX, coneY, coneZ
                ...
            }

            float cylX = x;
            float cylY = y;
            float cylZ = z;
            {
                float beta = cylX / cylRadius;

                // Rotate (0,0,0) by beta around line given by x = 0, z = cylRadius.
                // aka Rotate (0,0,-cylRadius) by beta, then add cylRadius back to z coordinate
                cylZ = -cylRadius;
                cylX = -cylZ * sin(beta);
                cylZ = cylZ * cos(beta);
                cylZ += cylRadius;

                // Then rotate by angle about the y axis
                cylX = cylX * cos(angle) - cylZ * sin(angle);
                cylZ = cylX * sin(angle) + cylZ * cos(angle);

                // Transform coordinates to the page
                cylX = (cylX * pageCoordTransform) - pageMaxX;
                cylY = (-cylY * pageCoordTransform) + pageMaxY;
                cylZ = cylZ * pageCoordTransform;
            }

            // combine cone & cylinder systems
            x = conicContribution * coneX + (1-conicContribution) * cylX;
            y = conicContribution * coneY + (1-conicContribution) * cylY;
            z = conicContribution * coneZ + (1-conicContribution) * cylZ;

            vertexBuffer[jOffset + i].position.x = x;
            vertexBuffer[jOffset + i].position.y = y;
            vertexBuffer[jOffset + i].position.z = z;
        }
    }

Automatic Vectorization

A new feature in the Visual Studio 2012 C++ compiler is automatic vectorization. The C++ compiler analyzes loop bodies and generates code targeting the SSE2 instruction set to take advantage of CPU vector units. For an introduction to the auto vectorizer, and plenty of other information, please see the vectorizer blog series.

The inner loop above is vectorized by the Visual Studio 2012 C++ compiler. The compiler is able to vectorize all of the transcendental functions in math.h, along with the standard arithmetic operations (addition, multiplication, etc). The generated code loads four values of x, y, and z. Then it computes four values of cylX, cylY, cylZ at a time, computes curlX, curlY, curlZ at a time, and stores the result into the vertex buffer for four vertices.

I know the code gets vectorized because I specified the /Qvec-report:1 option in my project settings, under Configuration Properties -> C/C++ -> Command Line, as per the following picture:

Then, after compiling, the output window shows which loops were vectorized, as per the following picture:

Eric's editorial: we decided late during the product cycle to include the /Qvec-report:1 and /Qvec-report:2 switches, and we did not have time to include them in the proper menu location.

If you do not see a loop getting vectorized and wonder why, you can specify the /Qvec-report:2 option. We offer some guidance on handling loops that are not vectorized in a vectorizer blog post.

Because of the power of CPU vector units, the 'i' loop gets sped up by a factor of 1.75. In this instance, we are able to compute P_curl (the combination of cone & cylinder) for four vertices at a time. This frees up CPU time for other rendering tasks, such as shading the page.

Performance

To curl a single page, we need to calculate P_curl for each vertex comprising a piece of paper. To my count, this involves 4 calls to sin, 3 calls to cos, 1 arcsin, 1 sqrt, and a dozen or so multiplications, additions and subtractions – for each vertex in a piece of paper – for each frame that we are rendering!

We aim to render at 60fps, which means we have around 15 milliseconds to curl the pages vertices and render them -- otherwise the app will feel sluggish. With this loop getting auto vectorized, we're able to free up CPU time for other rendering tasks, such as shading the page.

References

[1] L. Hong, S.K. Card, and J. Chen, "Turning Pages of 3D Electronic Books", in Proc. 3DUI, 2006, pp.159-165.

↧

C++ Runtime for Windows 8 Store apps

September 28, 2012, 5:46 pm

≫ Next: Project Austin Part 3 of 6: Ink Smoothing

≪ Previous: Project Austin Part 2 of 6: Page Curling

Background

If you have shipped software built using Visual C++, you probably have had to think about deploying C++ Runtime DLLs. If your binaries dynamically link to the C++ Libraries, then your desktop apps probably deploy C++ Runtime using VCRedist, merge modules or by copying C++ Runtime DLLs alongside your own binaries. In this blog post, we are going to look at how this problem has been addressed for Windows 8 Store apps that are written entirely using C++ or contain some components written using C++.

Windows 8 App packages and deployment

Windows 8 has reimagined the deployment model for Store apps. As a developer, you don't write routines to install or uninstall your Windows Store app. Instead, you package your app and submit it to the Windows Store. Users acquire your app from the Windows Store as an app package. Windows uses information in an app package to install the app on a per-user basis, and ensures that all traces of the app are gone from the device after all users who installed the app uninstall it. There are more details about this new model here.

Having a uniform deployment model simplifies the lives of app developers and having a single trusted source of apps in the form of the Store, provides greater confidence to the end-users. This in turns helps the ecosystem.

C++ Runtime package

Usually an app package is a fully contained, self-sufficient unit of deployment that contains all binaries, resources, assets etc. of your app. But there are times when you need to express dependencies on some external components such as the C++ Runtime DLLs. In order to provide this functionality, we have created a special package called the C++ Runtime package. This package is special in the sense that it is a Windows Component Library which other packages can depend on and which can be shared by multiple packages. This package contains all the C++ runtime DLLs relevant for Windows Store apps. If an app package specifies a dependency on the C++ Runtime package then at runtime, the app is able to load the C++ Runtime DLLs from the dependency package.

If you use Visual Studio 2012 to create your C++ app package, then VS automatically introduces a reference into your app’s AppXManifest.xml which basically expresses a dependency on the C++ Runtime package.

The C++ Runtime packages are already on the Store. So when you upload your C++ app (with a C++ Runtime dependency) to the Store, the Store is smart enough to associate your app with the latest C++ Runtime package version. Now whenever an end-user downloads your app from the Store, they also get the C++ Runtime package along with your app. The dependency package is downloaded only if it is not already present on the end-user’s machine or the version number on the end-user’s machine is less than the latest dependency package on the Store.

C++ Runtime Extension SDK and Dependencies

In order to express C++ Runtime dependency for C++ apps and to mimic the same runtime behavior on developer’s machines as would be seen on the end-user’s machines, we make use of the new Visual Studio Extension SDK mechanism.

If you look under the folder “Program Files (x86)\Microsoft SDKs\Windows\v8.0\ExtensionSDKs” on your machine, you will see a list of SDKs that your Windows Store apps can take advantage of using the Visual Studio “Add Reference” feature. For example, if I create a C++ Windows 8 Store project in VS, and invoke the “Add Reference” dialog, here is what it looks like on my machine:

You will notice that although the above mentioned folder contains an entry called “Microsoft.VCLibs” (which is basically the Extension SDK for C++ Runtime), it is not listed in the “Add Reference” dialog. This is because when you build a C++ project, VS automatically inserts a reference from your project to the Microsoft.VCLibs SDK.

So what really happens as a result of adding this SDK reference?

A couple of things:

1) When you build your app, the AppxManifest.xml file (that describes the properties of your package to Windows) automatically gets a dependency on the C++ Runtime package. If you look at the AppxManifest.xml file for your app package, you will see a section like the below:

This basically means your app package now requires that before it installs, a package with the name of Microsoft.VCLibs.110.00 (the C++ Runtime package) and at least a version of 11.0.50727.1 (the Visual Studio 2012 RTM version) must also be installed on the machine.

2) If you were installing this app from the Store, then as described earlier, Store will automatically push down the dependency package also. However, when you just want to debug the app on your developer machine, Visual Studio sees that your project has a reference to the Microsoft.VCLibs SDK so it knows that at runtime (when you hit F5 to run your app), it needs to deploy the C++ Runtime package (found at “Program Files (x86)\Microsoft SDKs\Windows\v8.0\ExtensionSDKs\Microsoft.VCLibs\11.0\AppX\Retail\x86”) along with your app. This way the runtime behavior is similar to what would be seen on an end-user’s machine.

If you examine the contents of the folder “Program Files (x86)\Microsoft SDKs\Windows\v8.0\ExtensionSDKs\Microsoft.VCLibs\11.0\AppX”, you will notice that it contains both debug and release packages for all architectures. Depending on the configuration of your project (debug or release), VS inserts the dependency on the appropriate C++ runtime package (debug or release). The debug packages are only meant for debugging purposes (used at F5) and are not uploaded to the Store. Which means that any app package expressing a dependency on the debug C++ Runtime package will not be accepted during Store submission.

Non-C++ apps and C++ Runtime Package

A great thing about Windows 8 Store apps is that it is really easy to build hybrid apps in which different components of the apps can be written in different languages and can talk to each other easily using the Windows Runtime technology (see here for an example).

Let’s say you create a Visual Studio project for building a Windows Store app using .NET or JavaScript. Now in the same solution you can add a C++ component project to perform some computation intensive job. When you add a reference from the main app project to the C++ component project, VS automatically detects that your overall app depends on a C++ component (which at runtime will need the C++ Runtime DLLs). So it inserts a dependency on the C++ Runtime package inside your app.

There are some scenarios in which you are writing an app using .NET or JavaScript and you want to use a component written using C++. However, you don’t have a Visual Studio project for this C++ component, but only the binaries and metadata which you will add to your app package manually. But these C++ component binaries will need the C++ Runtime DLLs. In such scenarios, you can directly add a reference from the main app project (.NET or JavaScript) to the C++ Runtime Package using the “Add Reference” dialog as shown below:

As you can see, in this case the C++ Runtime Package is listed as an option since it is not automatically referenced by the HTML/JavaScript app.

You will probably notice that in most cases, you as a Windows 8 Store app developer have to rarely think about the C++ Runtime. It is mostly handled for you by Visual Studio and by the Store. As always, we are happy to receive any feedback you might have about the above solutions. We are always looking to improve both the libraries functionality as well as the deployment process.

Thank you

Raman Sharma

↧

Project Austin Part 3 of 6: Ink Smoothing

October 4, 2012, 2:00 pm

≫ Next: C++/CX Part 3 of [n]: Under Construction

≪ Previous: C++ Runtime for Windows 8 Store apps

Hi, my name is Eric Brumer. I’m a developer on the C++ compiler optimizer, but I’ve spent some time working on Project Code Name Austin to help showcase the power and performance of C++ in a real world program. For a general overview of the project, please check out the introduction blog post.

This blog post describes how we perform ink smoothing.

Consider a straightforward ink drawing mechanism: draw straight lines between each stylus input point that is sampled. The devices and drivers we have been using on Windows 8 sample 120 input points per second. This may seem like a lot, but very swift strokes can sometimes cause visible straight edges. Here’s a sample from the app (without ink smoothing) which shows some straight edges:

Here is the same set of ink strokes, but with the ink stroke smoothed.

Spline

We are using a spline technique to do real time ink smoothing. Other options were considered, but the spline (a) can be done in real time so the strokes you draw are always smooth as new input points are sampled and (b) are computationally feasible.

There is plenty of literature online about spline smoothing techniques, but in my (limited) research I have either found descriptions that are too simplistic, or descriptions that require a degree in computer graphics to understand. So here’s my shot at something in the middle...

Before computers, a technique was used to create smoothed curves using a tool called a spline. This was a flexible material (heavy rope, a flexible piece of wood, etc) that could bend into shape, but also be fixed at certain locations along its body. For example, you could take a piece of heavy rope, pin the rope to a wall using a bunch of pins in different locations along the rope, then trace the outline of the bendy rope to yield a spline-smoothed curve.

Fast forward several decades and now we are using the same principles to create a smoothed line between a set of points. Say we have a line with many points P0, P1, P2, … To smooth it using a spline, we take the first 4 points (P0, P1, P2, P3) and draw a smooth curve that passes through P1 & P2. Then we move the window of 4 points to (P1, P2, P3, P4) and draw a smooth curve that passes through P2 & P3. Rinse and repeat for the entire curve. The reason it’s a spline technique is that we consider the two points as being ‘pinned’, just like pinning some rope to a wall.

Before going into how we draw the smoothed line between those points, let’s examine the benefits:

We only need four points to draw a smoothed line between the middle two. As you are drawing an ink stroke with your stylus, we are constantly able to smooth the stroke. I.e. we can do real time smoothing.
The computation is bounded, and by some neat compiler optimizations and limiting the number of samples when drawing the smoothed line (see item 2 below) we can ensure ink smoothing won’t be on the critical path of performance.

There are a few things to keep in mind:

We need to handle drawing a smoothed line between the first two points (P0 & P1), as well as drawing the smoothed line between the last two points on the curve. I do these by faking up those points and applying the same spline technique.
I keep writing “draw a smoothed line between two points”. We can’t draw a smoothed line; we can only draw a bunch of straight lines that look smooth. So when I say “draw a smoothed line between two points” what I mean to say is “draw many straight lines that look smooth which connect two points”. We just sample points along the curved line at regular intervals which are known to look smooth at the pixel level.

Cubic Spline & Cardinal Spline

Now on to the mathematical meat… When a graphics person says that a line is smooth at a given point, what they are saying is that the line is contiguous at that point, the first derivative of the line is contiguous at that point, and the second derivative is contiguous at that point. Apologies if I’m bringing back horrible memories of high school or college calculus.

Here’s a visual of five points with the smoothed line already drawn in blue.

We can define each segment of the smoothed blue curve as being parameterized by a parameter “t” which goes from 0 to 1. So the blue line is the concatenation of 4 curves given by:

P01(t) where t ranges from 0 to 1 for the first segment (from P0 to P1)
P12(t) where t ranges from 0 to 1 for the second segment (from P1 to P2)
… etc …

Using the ` character to mean derivative, applying the definition of smooth at the endpoints of each of the segments yields a bunch of equations:

P01(t=1) = P12(t=0)                         P`01(t=1) = P`12(t=0)                      P``01(t=1) = P``12(t=0)
P12(t=1) = P23(t=0)                         P`12(t=1) = P`23(t=0)                      P``12(t=1) = P``23(t=0)
        … etc …

Solving those equations exactly is trying. See spline interpolation. In general, if you are looking for a polynomial to satisfy an equation with second derivatives, you are shopping for a polynomial of degree 3, aka a cubic polynomial. Hence the ‘cubic’ in cubic spline.

The Wikipedia page shows a solution to fit the smoothness equations, but a lot of work has been done in this space to come up with a more computationally feasible solution that looks just as smooth. Basically, we lessen the second derivative equations and say P``01(t=1) ~= P``12(t=0), etc. This opens up many possibilities – look up any cubic spline and you’ll see many options.

After much experimenting, I found that the Cardinal spline works best for our ink strokes. The cardinal spline solution for the smoothed curve between 4 points P0, P1, P2, P3 is as follows:

The factor L is used to simulate the “tension in the heavy rope”, and can be tuned as you see fit. We chose a value around 0.5. If you are so inclined, you can also write out P23(t), take a bunch of derivatives and see this fits the smoothness equations. If you are a high school calculus teacher, please don’t make your students do this for homework.

The formula can be expressed in C++:

    for (int i=0; i<numPoints; i++)
    {
        float t = (float)i/(float)(numPoints-1);
        smoothedPoints_X[i] =     (2*t*t*t - 3*t*t + 1) * p2x
                                + (-2*t*t*t + 3*t*t)     * p3x
                                + (t*t*t - 2*t*t + t)    * L*(p3x-p1x)
                               + (t*t*t - t*t)          * L*(p4x-p2x);

        smoothedPoints_Y[i] =     (2*t*t*t - 3*t*t + 1) * p2y
                                + (-2*t*t*t + 3*t*t)     * p3y
                                + (t*t*t - 2*t*t + t)    * L*(p3y-p1y)
                                + (t*t*t - t*t)          * L*(p4y-p2y);
    }

numPoints (the number of points to sample on our smoothed line) is based on the minimum interval for what we thought looked good.

Performance

Like I mentioned before, we do real-time ink smoothing. That is to say an ink stroke is smoothed as it is drawn. We need to make sure that drawing a smooth line does not take too long otherwise we’ll notice a drop in frame rate where the ink stroke lags behind your stylus.

One of the benefits of writing this app in C++ is the opportunity for compiler optimizations to kick in. In this particular case, the cardinal spline equations are auto-vectorized by the Visual Studio 2012 C++ compiler. This yields a 30% performance boost when smoothing ink strokes, ensuring we can smooth ink points as fast as Windows can sample them. Also, any extra computing time saved lets us (a) do more computations to make the app better, or (b) finish our computations early, putting the app to sleep thus saving power.

Read all about the auto vectorizer here: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/12/auto-vectorizer-in-visual-studio-11-overview.aspx

↧

C++/CX Part 3 of [n]: Under Construction

October 5, 2012, 9:45 am

≫ Next: CTP of Windows XP Targeting with C++ in Visual Studio 2012

≪ Previous: Project Austin Part 3 of 6: Ink Smoothing

See C++/CX Part 0 of [n]: An Introduction for an introduction to this series and a table of contents with links to each article in the series.

In this article, we'll take a look at the how runtime classes are constructed. We'll use the following Widget runtime class throughout this article:

    public ref class Widget sealed
    {
    public:
    
        Widget()           : _number(0)      { }
        Widget(int number) : _number(number) { }
    
        int GetNumber() { return _number; }
    
    private:
        int _number;
    };

This type has both a default constructor and a constructor with an int parameter. C++/CX runtime class constructors are largely the same as constructors for ordinary C++ class types. Like ordinary member functions, any constructor that is part of the public interface of the runtime class can only use Windows Runtime types in its signature. This rule applies to public and protected constructors of public runtime classes, because these form the interface of the runtime class. Otherwise, there isn't much more to say about runtime class constructors.

C++/CX adds a new operator, ref new, that is used to construct an instance of a runtime class. For example, we can easily construct a Widget instance using either of its constructors:

    Widget^ widgetZero   = ref new Widget();
    Widget^ widgetAnswer = ref new Widget(42);

The behavior of ref new is comparable to that of new: it takes the runtime class to be constructed and a set of arguments to be passed to the constructor of that runtime class, and it constructs an instance of that type. Whereas new T() yields a T* pointing to the new object, ref new T() yields a T^. In this respect, ref new is similar to the make_shared helper function that can be used to safely construct a shared_ptr.

Much as we saw in the previous articles, aside from the syntactic tags that tell the compiler that Widget is a Windows Runtime type (e.g., ^ and ref), this code looks almost exactly like equivalent C++ code that works with ordinary C++ types. Constructors are declared the same way, and ref new is largely used in the same way as new is used. However, this syntactic simplicity hides quite a bit of complex machinery, which is what we're going to investigate here.

Because C++/CX hides all of the complexity here, we'll instead use WRL to explain how object construction works. So that we have somewhere to start, we'll translate our Widget type into WRL. First the IDL that declares the Widget runtime type and its default interface, IWidget:

    [exclusiveto(Widget)]
    [uuid(ada06666-5abd-4691-8a44-56703e020d64)]
    [version(1.0)]
    interface IWidget : IInspectable
    {
        HRESULT GetNumber([out] [retval] int* number);
    }
     
    [version(1.0)]
    runtimeclass Widget
    {
        [default] interface IWidget;
    }

And the C++ definition of the Widget type:

    class Widget : public RuntimeClass<IWidget>
    {
        InspectableClass(RuntimeClass_WRLWidgetComponent_Widget, BaseTrust)
    
    public:
    
        Widget()           : _number(0)      { }
        Widget(int number) : _number(number) { }
    
        STDMETHODIMP GetNumber(int* number) { *number = _number; return S_OK; }
    
    private:
    
        INT32 _number;
    };

Please note: for brevity, error handling code has been omitted from most of the examples in this article. When writing real code, please be sure to handle error conditions, including null pointers and failed HRESULTs.

While this C++ Widget class defines two constructors, these constructors are implementation details of the Widget type. Recall from Part 1 that we only ever interact with a runtime class object via an interface pointer: since the constructors are not declared by any interface, they are not part of the public interface of the runtime class.

Where Do Widgets Come From?

The structure of a runtime class is an implementation detail of the particular language and framework that are used to implement the runtime class; thus, the way in which an object of such a type is constructed is also an implementation detail, since construction is inexorably linked to the structure of the type. Windows Runtime components are consumable from any language that supports the Windows Runtime, so we need a language-neutral mechanism for constructing runtime class objects.

The Windows Runtime uses activation factories for constructing runtime class objects. An activation factory is a runtime class whose purpose is to construct objects of a particular runtime class type. We will define a WidgetFactory activation factory that constructs Widget objects.

Like any other runtime class, an activation factory implements a set of interfaces. Every activation factory must implement the IActivationFactory interface, which declares a single member function: ActivateInstance. The ActivateInstance interface function takes no arguments and returns a default-constructed object. An activation factory can also implement user-defined factory interfaces that define other "construction" functions. For our WidgetFactory, we'll use the following factory interface:

    [exclusiveto(Widget)]
    [uuid(5b197688-2f57-4d01-92cd-a888f10dcd90)]
    [version(1.0)]
    interface IWidgetFactory : IInspectable
    {
        HRESULT CreateInstance([in] int value, [out] [retval] Widget** widget);
    }

A factory interface may only declare factory functions: each function must take one or more arguments and must return an instance of the runtime class. The Widget runtime class has only one non-default constructor, so we only need to declare a single factory function here, but it's possible to define as many factory functions as are needed. When using C++/CX, the compiler automatically generates a factory interface for each public ref class, with factory functions whose arguments correspond to those of each of the constructors of the ref class.

In addition to defining a factory interface for the Widget type, we also need to annotate it in the IDL as being activatable. We do so using the activatable attribute, which has two forms, both of which we will use for our Widget type:

    [activatable(1.0)]
    [activatable(IWidgetFactory, 1.0)]

The first form declares the type as being default constructible. The second form declares that the IWidgetFactory is a factory interface for the runtime class. (The 1.0 in each is a version number; it is not relevant for this discussion.) When the midlrt compiler compiles the IDL file into a Windows Metadata (WinMD) file, it will use these attributes to add the correct set of constructors to the metadata for the runtime class.

Next, we need to implement a WidgetFactory type that implements both the IActivationFactory and the IWidgetFactory interfaces. Instead of using the WRL RuntimeClass base class template, we'll use the ActivationFactory base class template, which is designed to support activation factories.

    class WidgetFactory : public ActivationFactory<IWidgetFactory>
    {
        InspectableClassStatic(RuntimeClass_WRLWidgetComponent_Widget, BaseTrust)
    
    public:
    
        STDMETHODIMP ActivateInstance(IInspectable** widget) override
        {
            *widget = Make<Widget>().Detach();
            return *widget != nullptr ? S_OK : E_OUTOFMEMORY;
        }
    
        STDMETHODIMP CreateInstance(int value, IWidget** widget) override
        {
            *widget = Make<Widget>(value).Detach();
            return *widget != nullptr ? S_OK : E_OUTOFMEMORY;
        }
    };

ActivationFactory provides a default implementation of the IActivationFactory interface; this default implementation simply defines ActivateInstance as returning E_NOTIMPL. This is suitable for runtime classes that are not default constructible; for runtime classes that are default constructible (like Widget), we need to override ActivateInstance to actually default construct an object.

Make<Widget>() is effectively equivalent to new (nothrow) Widget(): it dynamically allocates memory for a Widget and passes the provided arguments to the Widget constructor. Like new (nothrow), it yields nullptr if allocation fails (remember, we can't throw an exception from a function implementing an interface, we must return an HRESULT). It returns a ComPtr<Widget>; since we are returning the interface pointer, we simply detach the pointer and return it (the caller is responsible for calling Release on all returned interface pointers).

That's all we need to implement the WidgetFactory activation factory. If we can get an instance of the factory, we can easily create Widget objects. For example,

    void Test(ComPtr<IWidgetFactory> const& factory)
    {
        ComPtr<IWidget> widget;
        factory->CreateInstance(42, widget.GetAddressOf());
	
        // Hooray, we have a widget!
    }

Where Do Widget Factories Come From?

To enable construction of Widget objects, we built a WidgetFactory, so to enable construction of WidgetFactory objects, we'll build a WidgetFactoryFactory. Then, to enable construction of those... Ha! Just kidding. ;-)

Each activatable runtime class is defined in a module (DLL). Each module that defines one or more activatable runtime classes must export an entry point named DllGetActivationFactory. It is declared as follows:

    HRESULT WINAPI DllGetActivationFactory(HSTRING              activatableClassId,
                                           IActivationFactory** factory);

This function is, in a sense, a factory for activation factories: it takes as an argument the name of a runtime class (activatableClassId) and it returns via the out parameter factory an instance of the activation factory for the named type. If the module does not have an activation factory for the named type, it returns a failure error code. (Aside: HSTRING is the Windows Runtime string type, which we'll discuss in a future article.)

Conceptually, we can think of the function as being implemented like so:

    HRESULT WINAPI DllGetActivationFactory(HSTRING              activatableClassId,
                                           IActivationFactory** factory)
    {
        // Convert the HSTRING to a C string for easier comparison:
        wchar_t const* className = WindowsGetStringRawBuffer(activatableClassId, nullptr);
    
        // Are we being asked for the Widget factory?  If so, return an instance:
        if (wcscmp(className, L"WidgetComponent.Widget") == 0)
        {
            *factory = Make<WidgetFactory>().Detach();
            return S_OK;
        }
    
        // If our module defines other activatable types, we'd check for them here.
    
        // Otherwise, we return that we failed to satisfy the request:
        *factory = nullptr;
        return E_NOINTERFACE;
    }

In practice, we should never have to do much work to implement this function. When C++/CX is used to build a component, the compiler will implement this function automatically if the _WINRT_DLL macro is defined (this macro is defined by default in the Windows Runtime Component project template in Visual Studio). With WRL, a bit of work is required, but it's quite straightforward. Each activatable class must be registered with WRL using one of the ActivatableClass macros. For example, to register our Widget type with its WidgetFactory activation factory, we can use the ActivatableClassWithFactory macro:

    ActivatableClassWithFactory(Widget, WidgetFactory)

Because many types only permit default construction, and because default construction makes use of the IActivationFactory interface and doesn't require any custom, type-specific logic, WRL also provides a helpful form of this macro, ActivatableClass. This macro generates a simple activation factory that allows default construction, and registers the generated activation factory. We used this macro in Part 1 when we translated the Number class from C++/CX into WRL.

If all of the activatable runtime classes are registered with WRL, we can simply have DllGetActivationFactory delegate to WRL and let WRL do all of the hard work.

    HRESULT WINAPI DllGetActivationFactory(HSTRING              activatibleClassId,
                                           IActivationFactory** factory)
    {
        auto &module = Microsoft::WRL::Module<Microsoft::WRL::InProc>::GetModule();
        return module.GetActivationFactory(activatibleClassId, factory);
    }

At this point, we have everything that we need to make a runtime class constructible: we have a factory that can construct instances of our runtime class and we have a well-defined way to obtain the factory for any activatable runtime class, so long as we know the module in which that runtime class is defined.

Creating an Instance

We've finished implementing our activatable runtime class; now let's take a look at ref new and what happens when we create a Widget instance. At the beginning of this article, we started off with the following:

    Widget^ widget = ref new Widget(42);

We can translate this into the following C++ code that uses WRL instead of C++/CX:

    HStringReference classId(RuntimeClass_WidgetComponent_Widget);
    
    ComPtr<IWidgetFactory> factory;
    RoGetActivationFactory(
        classId.Get(),
        __uuidof(IWidgetFactory),
        reinterpret_cast<void**>(factory.GetAddressOf()));
    
    ComPtr<IWidget> widget;
    factory->CreateInstance(42, widget.GetAddressOf());

Instantiation is a two-step process: first we need to get the activation factory for the Widget type, then we can construct a Widget instance using that factory. These two steps are quite clear in the WRL code. RoGetActivationFactory is a part of the Windows Runtime itself. It:

finds the module that defines the named runtime type,
loads the module (if it hasn't already been loaded),
obtains a pointer to the module's DllGetActivationFactory entry point,
calls that DllGetActivationFactory function to get an instance of the activation factory,
calls QueryInterface on the factory to get a pointer to the requested interface, and
returns the resulting interface pointer.

Most of these are straightforward and require no further comment. The exception is the first item: how, exactly, does the Windows Runtime determine which module to load to instantiate a particular type? While there is a requirement that the metadata for a type be defined in a WinMD file whose name is similar to the type name, there is no such requirement for the naming of modules: the Widget type may be defined in any module.

Every Windows Store app contains a file named AppXManifest.xml. This manifest contains all sorts of important information about the app, including its identity, name, and logo. The manifest also contains a section containing extensions: this section contains a list of all of the modules that define activatable types and a list of all of the activatable types defined by each of those modules. For example, the following entry is similar to what we would find for the Widget type:

    <Extension Category="windows.activatableClass.inProcessServer">
      <InProcessServer>
        <Path>WidgetComponent.dll</Path>
        <ActivatableClass ActivatableClassId="WidgetComponent.Widget" ThreadingModel="both" />
      </InProcessServer>
    </Extension>

The list includes only types defined by modules contained in the app package; types provided by Windows (i.e., types in the Windows namespaces) are registered globally, in the registry, and are not included in the AppXManifest.xml manifest.

For most projects, this manifest is created as part of the app packaging task that runs after an app is built. The contents of the extensions section is automatically populated via examination of the WinMD files for any referenced components and the manifests from any referenced Extension SDKs. When our app calls RoGetActivationFactory, the Windows Runtime uses this list to find the module it needs to load for the Widget type.

It should be noted that, for performance, activation factories may be cached on both sides of the ABI boundary: our component that defines the Widget type really only needs to create a single instance of the WidgetFactory; it doesn't need to create a new instance every time it is asked for the factory. Similarly, our app can cache the factory it got back from RoGetActivationFactory to avoid having to round-trip through the runtime every time it needs to construct a Widget. If our app creates lots of Widgets, this may make a huge difference. Both WRL and C++/CX are pretty smart with respect to this caching.

In Conclusion

It suffices to say that the C++/CX syntax hides a substantial amount of complexity! We started off with what were effectively four lines of C++/CX: two lines to declare the constructors, and two statements to demonstrate use of those constructors. We have ended up with, well, an awful lot more than that. For the two constructors, we have a Widget-specific factory interface, an activation factory that implements that interface and the IActivationFactory interface, and a module entry point that creates factories. For the ref new expressions, we have a round-trip through the Windows Runtime infrastructure.

Note that everything described here applies to the general case of defining a constructible type in one module and instantiating that type from another module. That is, this is the mechanism for constructing objects through the ABI. If the type is defined in the same module as the code that is instantiating the type, the compiler is able to avoid much of the overhead that is required for calls across the ABI boundary. A future article will discuss how things work within a single module, but know that things are often much simpler in that case.

If you've found these articles useful, we'd love to hear from you in the comments! Let us know if you have any questions, ideas, or requests, either about this series of articles or about C++/CX in general. You can also follow me on Twitter (@JamesMcNellis), where I tweet about a wide range of C++-related topics.

↧

CTP of Windows XP Targeting with C++ in Visual Studio 2012

October 8, 2012, 5:26 pm

≫ Next: Project Austin Part 4 of 6: C++ AMP acceleration

≪ Previous: C++/CX Part 3 of [n]: Under Construction

Background

In June, we announced enhanced targeting for Windows XP using the Visual Studio 2012 C++ compiler and libraries. This feature has been included in Visual Studio 2012 Update 1 CTP 3. You can download it from here. Today, we would like to give an overview of the Windows XP targeting experience, the level of C++ runtime support, and noteworthy differences from the default experience shipped in Visual Studio 2012 at RTM.

Windows XP Targeting Experience

In order to target Windows XP, switch from the default v110 toolset to the newly introduced v110_xp toolset inside your project’s property pages. This new platform toolset points to a repackaged version of the Windows 7 SDK shipped in Visual Studio 2010 instead of the Windows 8 SDK, but uses the same Visual Studio 2012 compiler. The v110_xp toolset also sets useful defaults such as a compatible linker subsystem version for downlevel targeting. Only executables built with this platform toolset are supported to run on Windows XP, but those same executables will also run on Vista, Windows 7 and Windows 8.

C++ Runtime Support

The static and dynamic link libraries for the CRT, ConCRT/PPL, STL, and MFC have been updated in-place to add runtime support for Windows XP and Windows Server 2003. Applications written in C++/CLI which target the .NET Framework 4.0 will also run on Windows XP and Windows Server 2003. For these operating systems, the supported versions are Windows XP SP3 for x86, Windows XP SP2 for x64, and Windows Server 2003 SP2 for both x86 and x64.

Library	v110 (Vista+)	v110 (Store Apps)	v110_xp (XP/2k3+)
CRT	X	X	X
ConCRT/PPL	X	X	X
STL	X	X	X
MFC	X		X
ATL	X	X	X
C++ AMP	X	X

Differences from Vista+ Targeting

Building HLSL
Building HLSL with the v110_xp toolset is not enabled by default. To enable HLSL compilation, download the DirectX SDK (June 2010) and set your project’s VC directories manually to point to this SDK, in a similar manner as Visual Studio 2010. For more information, see the “DirectX SDK Does Not Register Include/Library Paths with Visual Studio 2010” section of the DirectX SDK (June 2010) download page.
Debugging DirectX
The Visual Studio 2012 Graphics Debugging experience is not supported when targeting DirectX 9.
Static Analysis
When selecting the v110_xp platform toolset, the static analysis experience is disabled due to incompatibilities between the SAL annotations in the Visual Studio 2012 C++ libraries and the Windows 7 SDK. If static analysis is required, we recommend that you switch the solution to the normal v110 toolset, execute static analysis, and then switch back to v110_xp.
Remote Debugging
The Remote Tools for Visual Studio 2012 do not support remote debugging on an XP client. When debugging on Windows XP is required, it is recommended to use the debuggers of an older version of Visual Studio, such as Visual Studio 2010, for local or remote debugging. This is in line with the Windows Vista experience for Visual Studio 2012 RTM, which is a runtime target but not a remote debugging target.

Known issues with Visual Studio 2012 Update 1 CTP 3

There are two known issues with the CTP 3 release. These will be fixed in the next release of Visual Studio 2012 Update 1:

The Visual C++ 2012 Redistributables in “Microsoft Visual Studio 11.0\VC\redist\1033” have not been updated to include Windows XP support. For this preview release, please use static linking when targeting Windows XP or deploy the C++ runtime DLLs from “Microsoft Visual Studio 11.0\VC\redist\<architecture>” inside your executable’s installation folder.
Registering in-process ATL components currently fails on Windows XP and Windows Server 2003. A workaround is to use registration-free COM.

Feedback

As always, we'd love to hear your feedback. Please submit bugs to Visual Studio Connect, and suggestions to Visual Studio UserVoice.

↧

Project Austin Part 4 of 6: C++ AMP acceleration

October 11, 2012, 11:00 am

≫ Next: Check out the new C++ AMP book by Kate Gregory and Ade Miller

≪ Previous: CTP of Windows XP Targeting with C++ in Visual Studio 2012

Hello, I am Amit Agarwal, a developer on the C++ AMP team. C++ AMP is a new technology available in Visual Studio 2012 that enables C++ developers to make the best use of available heterogeneous computing resources in their applications from within the same C++ sources and the VS IDE they use for programming the CPU. Austin is a digital note-taking app for Windows 8 and the visually engaging 3D effects associated with page turning in the Austin app are powered by the use of C++ AMP.

A page surface is modeled as a 3D mesh comprised of a collection of triangles each defined by the location of its vertices in 3 dimensions. The page turning animation involves a compute-intensive page curling algorithm comprised of two main steps:

Deformation of the page surface mesh, used to calculate vertex positions for each frame.
Calculating the vertex normals, subsequently used for applying shading to the page surface.

Both these steps are highly data parallel in nature and can be accelerated using C++ AMP to utilize the floating point arithmetic prowess of modern GPUs, hence improving the overall frame rate of the page turning animation. The page deformation step is currently implemented on the CPU; efforts to accelerate this step using C++ AMP are underway and we will talk about it in a future post.

In this blog post we will talk about accelerating the calculation of vertex normals using C++ AMP which is already part of the current version of Austin. But before we dive into the details, here is a picture depicting the page turning animation in Austin which is accelerated using C++ AMP.

Introduction

Vertex normals are typically calculated as the normalized average of the surface normals of all triangles containing the vertex. Using this approach, computing the vertex normals on the CPU simply involves iterating over all triangles depicting the page surface and accumulating the triangle normals in the normals of the respective vertices.

In pseudo code:

for each triangle

…

    Position vertex1Pos = triangle.vertex1.position;

    Position vertex2Pos = triangle.vertex2.position;

    Position vertex3Pos = triangle.vertex3.position;

    Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);

    triangleNormal.normalize();

    vertex1.normal += triangleNormal;

    vertex2.normal += triangleNormal;

    vertex3.normal += triangleNormal;

Accelerating vertex normals calculation using C++ AMP

As mentioned earlier, calculation of vertex normals is highly amenable to C++ AMP acceleration owing to its data parallel and compute intensive nature.

A simple starting point would be to replace the “for each triangle” loop in the CPU implementation with a C++ AMP parallel_for_each call. The compute domain of the parallel_for_each call is the number of triangles depicting the page, specified as an extent argument. In simple terms this can be thought of as launching as many threads on the accelerator as the number of triangles (typically several thousands) with each thread responsible for computing the surface normal for a triangle and accumulating the value in the normals of the triangle’s vertices. However a vertex is part of multiple triangles and since the parallel_for_each threads execute concurrently, multiple threads can potentially attempt to accumulate their respective triangle normals to the same vertex resulting in a race. One way to address this would be to synchronize the accumulation of each vertex’s normal by using C++ AMP atomic operations. Unfortunately, atomic operations are expensive on GPU accelerators and would be severely detrimental to the kernel’s performance.

A better alternative approach is to break the calculation of vertex normals into two steps:

Calculate the normal for each triangle.
For each vertex accumulate the normals from all triangles that the vertex is a part of and update the vertex normals after normalizing the accumulated value.

This approach comprises two parallel_for_each invocations. The first one launches as many GPU accelerator threads as there are triangles, with each thread computing the normal of a triangle from the positions of the triangle’s vertices. The triangle normal values are stored in a temporary intermediate concurrency::array_view which is subsequently used in the second stage for accumulating each vertex’s normal.

In pseudo code:

parallel_for_each(extent<1>(triangleCount), [=](index<1> idx) restrict(amp)

    Position vertex1Pos = triangle.vertex1.position;

    Position vertex2Pos = triangle.vertex2.position;

    Position vertex3Pos = triangle.vertex3.position;

    Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);

    triangleNormal.normalize();

    tempTriangleNormals[idx] = triangleNormal;

});

The second parallel_for_each launches as many threads as the number of vertices on the page, with each thread accumulating the normals of triangles that the vertex is a part of, from the temporary array_view used to store the triangle normal in the first parallel_for_each. Thereafter, the accumulated vertex normal is normalized and stored in the vertex normal array_view.

In pseudo code:

parallel_for_each(extent<2>(vertexCountY, vertexCountX), [=](index<2> idx) restrict(amp)

    // First get the existing vertex normal value

    Normal vertexNormal = vertexNormalView(idx);

    // Each vertex is part of 4 quads with each quad comprising of 2 triangles

    // Based on the vertex position, it is determined which triangles the vertex is

    // part of and whether that triangle's normal should be accumulated in the vertex normal

    if (isVertexOfQuad1Triangle1) {

        vertexNormal += tempTriangleNormals(quad1Triangle1_index);

    if (isVertexOfQuad1Triangle2) {

        vertexNormal += tempTriangleNormals(quad1Triangle2_index);

    if (isVertexOfQuad2Triangle1) {

        vertexNormal += tempTriangleNormals(quad2Triangle1_index);

    if (isVertexOfQuad2Triangle2) {

        vertexNormal += tempTriangleNormals(quad2Triangle2_index);

...

    vertexNormal.normalize();

    vertexNormalView(idx) = vertexNormal;

});

Finally, the normal components of each vertex in the DirectX vertex buffer are updated by reading out the contents of the vertex normal array_view on the CPU. The DirectX vertex buffer is now ready for rendering, with the vertex normal values used for shading the page.

The source code for Austin is freely available for download on CodePlex. The bits specific to C++ AMP acceleration of vertex normals computation are located in the class paper_sheet_node in the source file paper_sheet_node.hpp – the core C++ AMP acceleration code is in the function updateNormalsAmp and some C++ AMP specific initialization code is contained in the function ensureAmpInitialized.

C++ AMP performance considerations

Having looked at the high-level approach to accelerating the vertex normal computation using C++ AMP, let us now dive deeper into details of the C++ AMP implementation that are important from a performance perspective.

Struct of Arrays

Firstly, let us talk about the layout of input and output data accessed in the C++ AMP parallel_for_each kernels. The input of the first parallel_for_each invocation is an array_view of vertex positions, each position comprising of three single precision floating point values (x, y, z components). The output is an array_view of triangle normals, where each normal is again comprised of three floating point values (x, y, z components). The input and output of the 2^nd parallel_for_each kernel are both array_views of normals.

The position and normal data is stored on the CPU as an array of structs. However, the GPU accelerator memory yields optimal bandwidth if consecutive threads access consecutive memory locations – an access pattern that is commonly referred to as Memory Coalescing in GPU computing parlance. Hence to ensure optimal memory access behavior, the layout of position and normal data on the GPU is adapted to be in the form of three arrays which hold the x, y and z components (of the vertex position or normal) respectively. Note that this is different from the CPU where the data is laid out in memory as an array of structs where each struct comprises of 3 floating point values.

Persisting data in accelerator memory

The vertex normal values calculated in each frame are used in calculating the vertex normal values for the subsequent frame in the second parallel_for_each kernel. Consequently, it is beneficial to persist the vertex normal data in accelerator memory to be used for the vertex normal calculations in the subsequent frame instead of transferring the data from CPU memory in each frame.

Using staging arrays for transferring data between CPU and accelerator memory

The vertex position data is transferred from CPU to accelerator memory in each frame. Also, after computing the vertex normals on the GPU accelerator, the vertex normals are transferred back to the vertex buffer in CPU memory to be used for shading. Additionally, as noted earlier, the layout of the vertex position and normal data in accelerator memory is in the form of struct of arrays instead of the array of struct layout in CPU memory. For optimal data transfer performance between the CPU and accelerator memory we employ staging arrays which are used to stage the change in data layout in CPU memory. For example, the vertex positions are copied from the vertex buffer to a staging array in a struct of arrays form and are subsequently copied to the CPU. Similarly, the vertex normal data that is laid out as struct of arrays in GPU memory is copied out to a staging array and is subsequently copied to the vertex buffer on the CPU in an array of struct form.

Future improvements

A careful look at the two parallel_for_each kernels comprising the vertex normal calculation code using C++ AMP, reveals that both these kernels exhibit 2-D spatial locality of data accesses. For example, in the first parallel_for_each kernel, each thread loads the vertex position data for the vertices of its triangle and since neighboring triangles have common vertices, adjacent threads read the same vertex position data independently from accelerator global memory. Similarly, in the second parallel_for_each kernel, each vertex loads the triangle normal values of the triangles it is part of and since adjacent vertices are part of the same triangles, the same triangle normal values are independently read by adjacent threads from accelerator global memory.

The accelerator global memory has limited bandwidth and note that both C++ AMP kernels here are likely to be memory bound as described in this post on C++ AMP performance guidance. Consequently having multiple threads read the same data multiple times from accelerator global memory is wasteful. While GPU accelerators are designed to hide global memory access latency through heavy multithreading and fast switching between threads, it is generally advisable to optimize global memory accesses (for optimal global memory bandwidth utilization) by employing opportunities of data reuse between adjacent threads through fast on-chip tile_static accelerator memory. While the current implementation does not employ this technique, it is worth experimenting with the use of tile_static memory in this implementation -- something we intend to do in the future.

The C++ AMP texture types are another form of global accelerator memory that is typically backed by caches designed for 2-D spatial locality and may be another alternative to using tile_static memory for leveraging the spatial 2-D data locality inherent in the C++ AMP acceleration kernels.

In closing

In this post we looked at the approach of accelerating one of the compute intensive parts of the Austin application; viz. vertex normal calculations, using C++ AMP. While the actual gains obtained from C++ AMP acceleration depend on the available GPU hardware, if appropriately employed may yield orders of magnitude of improvement over CPU performance for compute intensive kernels in your applications. Also, in absence of DirectX 11 capable GPU hardware, C++ AMP employs a CPU fallback which uses your CPU’s multiple cores and SSE capabilities to accelerate the execution of your kernels. You can learn more about C++ AMP on the MSDN blog for Parallel Programming in Native Code.

We would love to hear your thoughts, comments, questions and feedback below or on the Parallel Programming in Native Code MSDN forum.

↧

Check out the new C++ AMP book by Kate Gregory and Ade Miller

October 12, 2012, 9:20 am

≫ Next: More Q&A for the C++ AMP book

≪ Previous: Project Austin Part 4 of 6: C++ AMP acceleration

The C++ AMP book by Kate Gregory and Ade Miller is available in print or online from your favorite retailer! What is in it for you? Among other things, you will discover how to:

Gain greater code performance using graphics processing units (GPUs)
Choose accelerators that enable you to write code for GPUs
Apply thread tiles, tile barriers, and tile static memory
Debug C++ AMP code with Microsoft Visual Studio®
Use profiling tools to track the performance of your code

I was able to corner Ade Miller long enough to answer a few questions:

Q1: “C++ AMP” in 140 characters or less?

“A hardware agnostic data parallel programming model for C++.”

That’s pretty terse :). The C++ bit is important: C++ AMP is C++ not C.

Q2. How did you get interested in writing about C++ AMP?

I’ve had a longstanding interest in GPU programming. I got into it when I was writing a book about the C++ Parallel Patterns Library (PPL) that shipped with Visual Studio 2010. Like a lot of other people I was amazed by the potential for huge performance gains. At the time CUDA was the obvious choice for GPU programming so I wrote a fair bit of CUDA code. At the same time I was actually lobbying the PPL team and telling them they should go after GPUs too. I was really happy to see how C++ AMP was turning out and to get involved with the book.

Q3. What kind of research did you do for this book?

I’d been writing code for CUDA and Thrust so I was already pretty familiar with the data parallel model. I talked to the product team a lot. They were a fantastic resource and I’m really grateful to them for giving me the time. I wrote or rewrote a lot of code. I didn’t write all the book samples from scratch but I rewrote them to make sure that they not only used C++ AMP in the best possible way but also embraced the whole modern C++ style. I also ended up reading some really old papers from the 80s, stuff like "Data Parallel Algorithms" by Hollis & Steele. A few of the first big supercomputers, like the Connection Machine, were data parallel. Those guys did a lot of great thinking around how to program in a data parallel way. In that way GPUs are the old new thing.

Q4. What was the hardest part about writing C++ AMP?

I think it’s fair to say that was the work Kate and I did on performance (chapters 7-9). C++ AMP provides an abstraction over the GPU which is great for portability but occasionally it can lead to unexpected results that require more
thought to figure out. We were also working with pre-release bits. The product team were working hard to improve the performance so our numbers would change between drops and on different hardware. On the plus side I learnt a whole lot more about the inner workings of GPUs.

Q5. What are you working on now?

Some extra material for the book; some fun stuff and some stuff I think we would have put in if we had the time. It’ll start appearing in the next few weeks. You can find out about it on my blog. I’m going to be doing some more coding with C++ and C++ AMP and I have another book project in mind.

Q6. What advice would you give new C++ developers?

Don’t worry too much about some of the darker corners of the language. It’s really easy to get caught up in some of the more esoteric features of C++ like performance tweaks and template (meta-)programming. There’s good stuff there for sure, but you’d be better off getting really comfortable with the Standard Library and the new language features of C++11. If you write modern C++ using algorithms, iterators and containers and use the RIIA pattern a lot of the things new C++ programmers struggle with pretty much disappear.

Q7. Do you have any favorite C++ favorite authors or books?

Hands down Scott Meyers. Right now I’m reading his “An Overview of the New C++ (C++11)”. It was reading his other books and listening to some of Herb Sutter’s talks that got me interested in C++ again after a long vacation in C#.

Q8. What question should I have asked?

What’s it like to be a famous author?

Q9. And the answer?

You’d have to ask J.K. Rowling! Although someone did recognize me on the subway in Montreal a while back, which was quite a surreal experience.

Thanks Ade!

Visit the C++ AMP Book landing page for updates, retailers, and pointers to additional resources.

Should I have asked other questions? Let me know in the comments. Thanks!

↧

More Q&A for the C++ AMP book

October 15, 2012, 9:30 am

≫ Next: Project Austin Part 5 of 6: Shadow Rendering

≪ Previous: Check out the new C++ AMP book by Kate Gregory and Ade Miller

As we shared last week, Kate Gregory and Ade Miller have released C++ AMP, an exploration of accelerated massive parallelism with Microsoft® Visual C++®. In that post, Ade answered a few of our hard-hitting questions. In this post, Kate answers the same set of questions:

Q1. “C++ AMP” in 140 characters or less?

C++ apps can speed up the data parallel parts by 100x – imagine what that makes possible! And it’s C++ all the way.

Q2. How did you get interested in writing about C++ AMP?

I’ve been watching parallelism and concurrency since PDC 2005 when both Herb Sutter and Jan Gray let us know the future would be concurrent and the free lunch was over. The minute C++ AMP was first shown publicly I knew I wanted to know more about it – which ended up leading to the book almost against my will. Writing books is hard!

Q3. What kind of research did you do for this book?

I had to learn all about a technology before it was even finished! I spent a lot of time watching code run, changing things and trying it again, and meeting with the team about it. There was some background reading before I got started so that I understood where C++ AMP fit into the larger picture, as well. And I watched what Ade, my co-author, did to algorithms to make them faster and faster and faster. That was an education on its own.

Q4. What was the hardest part about writing C++ AMP?

Writing about a moving target is always hard. We would finish an explanation or a sample, and then the new release would invalidate that and we needed to go back and start again on that. Or I would want an answer about how something worked and there was nothing anywhere about it yet. The good news there was that I could ask the team and get an authoritative answer remarkably quickly, but it was certainly a different process than some I’ve followed writing on some other topics.

Q5. What are you working on now?

I’ve just released a new course on using Visual Studio 2012, and I have a second part, on debugging and extensions, to complete. Then I’d like to start looking at the library situation in C++.

Q6. What advice would you give new C++ developers?

Do not read old books, watch old webcasts, or listen to old developers (like me!) unless they’ve converted to modern C++. You can get all worked up about char* strings and manual memory management and believe this language is super complicated and hard. Or you can use what the Standard Library has to offer, like std::string or the new smart pointers that actually are smart, and you’ll find it easy, readable, and faster than anything else.

Q7. Do you have any favorite C++ favorite authors or books?

Scott Meyers, Herb Sutter, and Andrei Alexandrescu are a can’t-miss combination. I’ll listen to any one of them, or read what they’ve written, and never regret it, though sometimes Andrei makes my head hurt from thinking so hard.

Q8. What question should I have asked?

Why have I stayed with C++ for so long when so many people moved to .NET?

Q9. And the answer?

I love the speed, the power, and the control that C++ offers. Over the last decade I’ve done a lot in managed code, as many people have, but I never stopped using C++ or feeling that it was special. Now that C++ is getting a lot of attention again, partly because of amazing technology like C++ AMP, I have to tell my friends “don’t call it a comeback – some of us never left!”

Thanks Kate!

Do you have questions or other feedback? Leave a comment!

↧

Project Austin Part 5 of 6: Shadow Rendering

October 18, 2012, 10:05 am

≫ Next: C++/CX Part 4 of [n]: Static Member Functions

≪ Previous: More Q&A for the C++ AMP book

When we designed the user experience of Austin, we spent quite some time thinking about the different page views and layouts, and how to transition between them. We wanted to create an immersive experience where the user can manipulate and navigate Austin's pages in an intuitive way; pages zoom in and out in 3D with pinch gestures, and the camera glides over the pages and even tilts a little bit when a finger is dragged on the screen. This makes the pages simply appear to "float" in 3D space and move around and re-arrange themselves as needed. To add some visual eye-candy, I spent some time having the pages cast shadows. This subtle addition makes the scene visually much richer and way more interesting.

I didn't really have a lot of experience in graphics programming, so this was somewhat of a challenge for me. I spent some time looking into very well-known algorithms, such as shadow mapping and volume shadows. I prototyped a version using shadow maps but couldn't quite make it look the way I wanted. In the end, I figured that I could probably fake most of it since I have the advantage that the geometry that would cast shadows in my scene is incredibly simple (just a flat, square piece of paper!), so I ended up with the technique I describe here. I am sure there're much better ways to do this, but I am very happy with the way the result looks. Speed was also a goal of mine as my initial implementations were quite a drag on my frame rates, but this implementation turned out to be very fast.

The source code for Austin, including the bits specific to shadow rendering described here, can be downloaded here.

So how does it work?

The fact that our pages always lie flat over the XY plane is a key point to solving this problem. This means we know the shadow that's going to be cast on that plane is always also a rectangle. There's one small exception to this which is when we curl a page as the user drags a finger across to pass to the next page, and in that case we just render a thinner, yet still rectangular, shadow depending on how curled the page is. It's not a perfect projection of the deformed page but it's good enough to do the trick.

With this in mind, we can draw a shadow by rendering a simple quad that's in the same position as the page but moved backwards in the Z axis so it sits flat on the XY plane, and scaled and offset a bit as well to "fake" the projection of the shadow volume cast on the XY plane. This quad is rendered in a darker color than the color of the background, which is always a solid color.

Drawing a solid quad would create a very "hard" shadow, so we want to soften it a bit. To do that we apply a Gaussian blur filter, which is implemented as a pixel shader on the GPU. This is a very well documented filter and there are plenty of articles online on how to do it. To see what the effect looks like, check out Gaussian blur effect on MSDN for an example.

The second thing we do that helps blur and soften the shadows is rendering this scene to a much smaller rendering target than screen, in our case to a 256x256 texture. Then when we copy the final texture to the back buffer, we blow it up to the screen dimensions, and this operation softens the shadows further due to the filtering applied to the texture to go from 256 pixels to the actual resolution. And as a bonus side effect, the 256x256 texture is much smaller and much faster to render to.

Finally, to add a bit more depth, we do one last post-processing effect and do a "radial darkening" such that the picture gets darker the further you go from the center of the background.

You can get the code for all of this in the Austin project, look at notebook::renderFrame for the main chunk of code performing the different steps, and in the quad_pixel_shader_*.hlsl files for the pixel shaders for the different effects. page::renderShadow contains the code that shows how the shadow node is scaled and translated to create the fake projection.

In pictures

An image is worth a thousand words, so let's see how all this looks in the app.

First, this is how the darker quads look like rendered on the flat background, on the 256x256 texture:

This is how it looks with the Gaussian blur applied:

And then, with the radial darken filter:

Finally, that's blown up to the screen size and then the paper pages rendered on top of it:

This next picture show how the same scene looks like without the shadows and just the flat background, and what a difference it all makes:

The following picture shows the difference between rendering directly to a render target the size of the screen and to a smaller (256x256) render target that then gets enlarged to the size of the screen:

In the end, this technique ended up looking very good and more importantly, giving me the performance I wanted. I had a great time doing this bit of work and I hope these ideas can help someone else trying to solve the same problem.

Can't get enough? Check out Cascaded Shadow Maps and Common Techniques to Improve Shadow Depth Maps for more details on achieving high-performance 3D shadows.

↧

C++/CX Part 4 of [n]: Static Member Functions

October 19, 2012, 10:25 am

≫ Next: Hello ARM: Exploring Undefined, Unspecified, and Implementation-defined Behavior in C++

≪ Previous: Project Austin Part 5 of 6: Shadow Rendering

See C++/CX Part 0 of [n]: An Introduction for an introduction to this series and a table of contents with links to each article in the series.

In this article, we'll take a look at static member functions and how they are supported by the Windows Runtime. A Windows Runtime reference type (also called a ref class in C++/CX, or a runtime class) can have static member functions. In C++/CX, the syntax used to declare a static member function in a runtime class is exactly the same as the syntax used in an ordinary C++ class. To demonstrate this, here is a runtime class with one static member function:

    public ref class KnownValues sealed
    {
    public:
        static int GetZero() { return 0; }

    private:
        KnownValues(); // This type can't be constructed
    };

(Note that we have declared a private default constructor to ensure that it is not possible to create an instance of this class. If we define a ref class and don't declare any constructors, the compiler will provide a public default constructor for the type, just as it would for an ordinary C++ class. It's possible to define a type that is constructible and has static members; we've just made this type non-constructible to make the next examples a bit simpler.)

Similarly, the syntax used to call a static member function declared by a runtime class is exactly the same as the ordinary C++ syntax. Here's how we'd call GetZero:

    int x = KnownValues::GetZero();

So, at least syntactically, there's nothing special about static member functions in C++/CX. However, the mechanism via which static member functions are supported by the Windows Runtime deserves some comment.

Implementation of the Static Member Function

A call to a static member function is made independent of any instance of the class that declares that function. A static member function has no this pointer. We don't need to create a KnownValues object in order to call its GetZero static member function. In order to allow a runtime class to have static member functions, we need some sort of method that allows us to call a function without first creating an instance of its declaring type.

It turns out that we've already solved this problem, in Part 3: Under Construction, when we implemented constructors using an activation factory. To summarize that article, we implemented support for constructors by:

converting each constructor into a function that returns a new instance of the type,
defining an interface, called a factory interface, that declares all of those construction functions,
defining a runtime class, called an activation factory, that implements the factory interface, and
providing a well-defined way to get an instance of the activation factory for an arbitrary type.

An activation factory allows us to implement functions associated with a runtime class that can be called without first creating an instance of that runtime class. A particular runtime class can only have one activation factory associated with it, but that activation factory can implement multiple interfaces. In addition to implementing zero or more factory interfaces, which declare construction functions, an activation factory can also implement zero or more static interfaces, which declare static member functions.

We'll re-implement the KnownValues type using C++ and WRL, but we won't go into too much detail; the previous article covers activation factories in depth and there aren't many differences here. First, here are the IDL declarations for the runtime class and its static interface, which are quite straightforward:

    [exclusiveto(KnownValues)]
    [uuid(ca8c9b14-f2a3-4f1e-aa50-49bfa3a5dbd3)]
    [version(1.0)]
    interface IKnownValuesStatics : IInspectable
    {
        HRESULT GetZero([out] [retval] int* value);
    }

    [static(IKnownValuesStatics, 1.0)]
    [version(1.0)]
    runtimeclass KnownValues
    {
    }

The static attribute on KnownValues specifies that the IKnownValueStatics interface is a static interface for the KnownValues runtime class. Note that the KnownValues type does not declare that it implements any instance interfaces (i.e., its body is empty). This is because no instance of the KnownValues runtime class will ever be created. This runtime class is really just a container used to define static member functions (in C# terminology, this would be called a static class).

The activation factory implementation is also straightforward:

    class KnownValuesFactory : public ActivationFactory<IKnownValuesStatics>
    {
        InspectableClassStatic(RuntimeClass_WRLKnownValuesComponent_KnownValues, BaseTrust)

    public:
 
        STDMETHODIMP GetZero(int* value) override
        {
            *value = 0;
            return S_OK;
        }
    };

    ActivatableStaticOnlyFactory(KnownValuesFactory)

Note that because we will never create an instance of KnownValues, we don't actually need to define a KnownValues class in C++. We only need to define the activation factory, which implements the IKnownValueStatics static interface.

All activation factories must also implement the IActivationFactory interface. The ActivationFactory base class template that we use provides a default implementation of this interface, which does the right thing for a non-activatable type. A particular runtime class may both be activatable and have static member functions. In that case, its activation factory would implement both a factory interface and a static interface.

Calling the Static Member Function

Since static member functions are implemented in the same way as constructors, it should come as no surprise that the process of calling a static member function is exactly the same as the process of calling a constructor. Two steps are required: first, we need to get the activation factory for the type, then we can call the function. The WRL code to invoke GetZero is as follows:

    HStringReference classId(RuntimeClass_WRLKnownValuesComponent_KnownValues);
    
    ComPtr<IKnownValuesStatics> statics;
    RoGetActivationFactory(
        classId.Get(),
        __uuidof(IKnownValuesStatics),
        reinterpret_cast<void**>(statics.GetAddressOf()));
    
    int x = 0;
    statics->GetZero(&x);

Aside from the error handling which has been omitted for brevity, this code is equivalent to the C++/CX invocation of GetZero from above:

    int x = KnownValues::GetZero();

↧

Hello ARM: Exploring Undefined, Unspecified, and Implementation-defined Behavior in C++

October 25, 2012, 4:58 pm

≫ Next: C++ at BUILD 2012

≪ Previous: C++/CX Part 4 of [n]: Static Member Functions

With the introduction of Windows RT for ARM devices, many Windows software developers will be encountering ARM processors for the first time. For the native C++ developer this means the potential for running afoul of undefined, unspecified, or implementation-defined behavior--as defined by the C++ language--that is expressed differently on the ARM architecture than on the x86 or x64 architectures that most Windows developers are familiar with.

In C++, the results of certain operations, under certain circumstances, are intentionally ambiguous according to the language specification. In the specification this is known as "undefined behavior" and, simply put, means "anything can happen, or nothing at all. You're on your own". Having this kind of ambiguity in a programming language might seem undesirable at first, but it's actually what allows C++ to be supported on so many different hardware platforms without sacrificing performance. By not requiring a specific behavior in these circumstances, compiler vendors are free to do whatever makes reasonable sense. Usually that means whatever is easiest for the compiler or most efficient for the target hardware platform. However, because undefined behavior can't be anticipated by the language, it's an error for a program to rely on the expressed behavior of a particular platform. As far as the C++ specification is concerned, it's perfectly standards-conforming for the behavior to be expressed differently on another processor architecture, to change between microprocessor generations, to be affected by compiler optimization settings, or for any other reason at all.

Another category of behavior, known as "implementation-defined behavior", is similar to undefined behavior in that the C++ language specification doesn't prescribe a particular expression, but is different in that the specification requires the compiler vendor to define and document how their implementation behaves. This gives the vendor freedom to implement the behavior as they see fit, while also giving users of their implementation a guarantee that the behavior can be relied upon, even if the behavior might be non-portable or can change in the next version of the vendor's compiler.

Finally, there is "unspecified behavior" which is just implementation-defined behavior without the requirement for the compiler vendor to document it.

All of this undefined, unspecified, and implementation-defined behavior can make porting code from one platform to another a tricky proposition. Even writing new code for an unfamiliar platform might seem at times like you've been dropped into a parallel universe that's just slightly askew from the one you know. Some developers might routinely cite the C++ language specification from memory, but for the rest of us the border between portable C++ and undefined, unspecified, or implementation-defined behavior might not always be at the forefront of our minds. It can be easy to write code that relies on undefined, unspecified or implementation-defined behavior without even realizing it.

To help you make a smooth transition to Windows RT and ARM development, we've compiled some of the most common ways that developers might encounter (or stumble into) undefined, unspecified, or implementation-defined behavior in "working" code--complete with examples of how the behavior is expressed on ARM, x86 and x64 platforms using the Visual C++ tool chain. The list below is by no means exhaustive, and although the specific behaviors cited in these examples can be demonstrated on particular platforms, the behaviors themselves should not be relied upon in your own code. We include the observed behaviors only because this information might help you recognize how your own code might rely on them.

Floating point to integer conversions

On the ARM architecture, when a floating-point value is converted to a 32-bit integer, it saturates; that is, it converts to the nearest value that the integer can represent, if the floating-point value is outside the range of the integer. For example, when converting to an unsigned integer, a negative floating-point value always converts to 0, and to 4294967295 if the floating-point value is too large for an unsigned integer to represent. When converting to a signed integer, the floating-point value is converted to -2147483648 if it is too small for a signed integer to represent, or to 2147483637 if it is too large. On x86 and x64 architectures, floating point conversion does not saturate; instead, the conversion wraps around if the target type is unsigned, or is set to -2147483648 if the target type is signed.

The differences are even more pronounced for integer types smaller than 32 bits. None of the architectures discussed have direct support for converting a floating-point value to integer types smaller than 32 bits, so the conversion is performed as if the target type is 32 bits wide, and then truncates to the correct number of bits. Here you can see the result of converting +/- 5 billion (5e009) to various signed and unsigned integer types on each platform:

Results of converting +5e+009 to different signed and unsigned integer sizes
+5e+009	ARM 32-bit	ARM 16-bit	ARM 8-bit	x86/x64 32-bit	x86/x64 16-bit	x86/x64 8-bit
unsigned	4294967295	65535	255	705032704	0	0
signed	+2147483647	-1	-1	-2147483648	0	0

Results of converting -5e+009 to different signed and unsigned integer sizes
-5e+009	ARM 32-bit	ARM 16-bit	ARM 8-bit	x86/x64 32-bit	x86/x64 16-bit	x86/x64 8-bit
unsigned	0	0	0	3589934592	0	0
signed	-2147483648	0	0	-2147483648	0	0

As you can see, there's no simple pattern to what's going on because saturation doesn't take place in all cases, and because truncation doesn't preserve the sign of a value.

Still other values introduce more arbitrary conversions. On ARM, when you convert a NaN (Not-a-Number) floating point value to an integer type, the result is 0x00000000. On x86 and x64, the result is 0x80000000.

The bottom line for floating point conversion is that you can't rely on a consistent result unless you know that the value fits within a range that the target integer type can represent.

Shift operators

On the ARM architecture, the shift operators always behave as if they take place in a 256-bit pattern space, regardless of the operand size--that is, the pattern repeats, or "wraps around", only every 256 positions. Another way of thinking of this is that the pattern is shifted the specified number of positions modulo 256. Then, of course, the result contains just the least-significant bits of the pattern space.

On the x86 and x64 architectures, the behavior of the shift operator depends on both the size of the operand and on whether the code is compiled for x86 or x64. On both x86 and x64, operands that are 32 bits in size or smaller behave the same--the patterns space repeats every 32 positions. However, operands that are larger than 32 bits in size behave differently when compiled for x86 and x64 architecture. Because the x64 architecture has an instruction for shifting 64-bit values, the compiler emits this instruction to carry out the shift; but the x86 architecture doesn't have a 64-bit shift instruction, so the compiler emits a small software routine to shift the 64-bit operand instead. The pattern space of this routine repeats every 256 positions. As a result, the x86 platform behaves less like its x64 sibling and more like ARM when shifting 64-bit operands.

Widths of the pattern spaces on each architecture:
Variable size	ARM	x86	x64
8	256	32	32
16	256	32	32
32	256	32	32
64	256	256	64

Let's look at some examples. Notice that in the first table the x86 and x64 columns are identical, while in the second table its the x86 and ARM columns.

Given a 32-bit integer with a value of 1:
Shift amount	ARM	x86	x64
0	1	1	1
16	32768	32768	32768
32	0	1	1
48	0	32768	32768
64	0	1	1
96	0	1	1
128	0	1	1
256	1	1	1

Given a 64-bit integer with a value of 1:
Shift amount	ARM	x86	x64
0	1	1	1
16	32768	32768	32768
32	4294967296	4294967296	4294967296
48	2^48	2^48	2^48
64	0	0	1
96	0	0	4294967296
128	0	0	1
256	1	1	1

To help you avoid this error, the compiler emits warning C4295 to let you know that your code uses shifts that are too large (or negative) to be safe, but only if the shift amount is a constant or literal value.

'volatile' behavior

On the ARM architecture, the memory model is weakly-ordered. This means that a thread observes its own writes to memory in-order, but that writes to memory by other threads can be observed in any order unless additional measures are taken to synchronize the threads. The x86 and x64 architectures, on the other hand, have a strongly-ordered memory model. This means that a thread observes both its own memory writes, and the memory writes of other threads, in the order that the writes are made. In other words, a strongly-ordered architecture guarantees that if a thread, B, writes a value to location X, and then writes again to location Y, then another thread, A, will not see the update to Y before it sees the update to X. Weakly-ordered memory models make no such guarantee.

Where this intersects with the behavior of volatile variables is that, combined with the strongly-ordered memory model of x86 and x64, it was possible to (ab)use volatile variables for certain kinds of inter-process communication in the past. This is the traditional semantics of the volatile keyword in Microsoft's compiler, and a lot of software exists that relies on those semantics to function. However, the C++11 language specification does not required that such memory accesses are strongly-ordered across threads, so it is an error to rely on this behavior in portable, standards-conforming code.

For this reason, the Microsoft C++ compiler now supports two different interpretations of the volatile storage qualifier that you can choose between by using a compiler switch. /volatile:iso selects the strict C++ standard volatile semantics that do not guarantee strong ordering. /volatile:ms selects the Microsoft extended volatile semantics that do guarantee strong ordering.

Because /volatile:iso implements the C++ standard volatile semantics and can open the door for greater optimization, it's a best practice to use /volatile:iso whenever possible, combined with explicit thread synchronization primitives where required. /volatile:ms is only necessary when the program depends upon the extended, strongly-ordered semantics.

Here's where things get interesting.

On the ARM architecture, the default is /volatile:iso because ARM software doesn't have a legacy of relying on the extended semantics. However, on the x86 and x64 architectures the default is /volatile:ms because a lot of the x86 and x64 software written using Microsoft's compiler in the past rely on the extended semantics. Changing the default to /volatile:iso for x86 and x86 would silently break that software in subtle and unexpected ways.

Still, it's sometimes convenient or even necessary to compile ARM software using the /volatile:ms semantics--for example, it might be too costly to rewrite a program to use explicit synchronization primitives. But take note that in order to achieve the extended /volatile:ms semantics within the weakly-ordered memory model of the ARM architecture, the compiler has to insert explicit memory barriers into the program which can add significant runtime overhead.

Likewise, x86 and x64 code that doesn't rely on the extended semantics should be compiled with /volatile:iso in order to ensure greater portability and free the compiler to perform more aggressive optimization.

Argument evaluation order

Code that relies on function call arguments being evaluated in a specific order is faulty on any architecture because the C++ standard says that the order in which function arguments are evaluated is unspecified. This means that, for a given function call F(A, B), it's impossible to know whether A or B will be evaluated first. In fact, even when targeting the same architecture with the same compiler, things like calling convention and optimization settings can influence the order of evaluation.

While the standard leaves this behavior unspecified, in practice, the evaluation order is determined by the compiler based on properties of the target architecture, calling convention, optimization settings, and other factors. When these factors remain stable, it's possible that code which inadvertently relies on a specific evaluation order can go unnoticed for quite some time. But migrate that same code to ARM, and you might shake things up enough to change the evaluation order, causing it to break.

Fortunately, many developers are already aware that argument evaluation order is unspecified and are careful not to rely on it. Even still, it can creep into code in some unintuitive places, such as member functions or overloaded operators. Both of these constructs are translated by the compiler into regular function calls, complete with unspecified evaluation order. Take the following code example:

 Foo foo;
 
 foo->bar(*p);

This looks well defined, but what if -> and * are actually overloaded operators? Then, this code expands to something like this:

 Foo::bar(operator->(foo), operator*(p));

Thus, if operator->(foo) and operator*(p) interact in some way, this code example might rely on a specific evaluation order, even though it would appear at first glance that bar() has only one argument.

Variable arguments

On the ARM architecture, all loads and stores are aligned. Even variables that are on the stack are subject to alignment. This is different than on x86 and x64, where there is no alignment requirement and variables pack tightly onto the stack. For local variables and regular parameters, the developer is well-insulated from this detail by the type system. But for variadic functions--those that take a variable number of arguments--the additional arguments are effectively typeless, and the developer is no longer insulated from the details of alignment.

This code example is actually a bug, regardless of platform. But what makes it interesting in this discussion is that the way that x86 and x64 architectures express the behavior happens to make the code function as the developer probably intended it to for a subset of potential values, while the same code running on the ARM architecture always produces the wrong result. Here's an example using the cstdio function printf:

 // note that a 64-bit integer is being passed to the function, but '%d' is being used to read it.
 // on x86 and x64, this may work for small values since %d will "parse" the lower 32 bits of the argument.
 // on ARM, the stack is padded to align the 64-bit value and the code below will print whatever value
 // was previously stored in the padded position.
 printf("%d\n", 1LL);

In this case, the bug can be corrected by making sure that the correct format specification is used, which ensures that the alignment of the argument is considered. The following code is correct:

 // CORRECT: use %I64d for 64 bit integers
 printf("%I64d\n", 1LL)

Conclusion

Windows RT, powered by ARM processors, is an exciting new platform for Windows developers. Hopefully this blog post has exposed some of the subtle portability gotchas that might be lurking in your own code, and has made it easier for you to bring your code to Windows RT and the Windows Store.

Do you have questions, feedback, or perhaps your own portability tips? Leave a comment!

↧

C++ at BUILD 2012

October 29, 2012, 11:12 am

≫ Next: Project Austin Part 6 of 6: Storage

≪ Previous: Hello ARM: Exploring Undefined, Unspecified, and Implementation-defined Behavior in C++

Experience the Build conference on Channel 9 and learn how to build fierce Windows 8 apps using C++ from the experts:

Bringing Existing C++ Code to Windows Store Apps, Tarek Madkour, 10/30 11:45am. Beyond just learning how to write apps in C++, you will see how to create new libraries or reuse existing components that you can seamlessly combine with Metro style apps using JavaScript, C#, and other C++ apps.
The Power of C++ - Project Austin App, Ale Contenti, 10/30 2:15pm. The Visual C++ team leveraged DirectX, C++ AMP, PPL and WinRT to take advantage of the Windows 8 hardware (including stylus). Come and dive into the Austin codebase as we share tips and tricks we discovered along the way.
Connecting C++ Apps to the Cloud via Casablanca, Niklas Gustafsson and Artur Laksberg, 10/30 5:45pm. With Casablanca, C++ developers get modern APIs based on C++ 11 with support for accessing and authoring REST services, asynchronous I/O libraries to support writing scalable and responsive code, and more!
Developing a Windows Store app using C++ and DirectX, Phil Napieralski, 10/30 5:45pm. This session will show you how to integrate Windows Store app features in your C++ and DirectX app. We'll cover things like creating an app bar, API support for monetization, handling live tiles and notifications, and registering Process Lifetime Management (PLM, or suspend and resume) events.
It’s all about performance: Using Visual C++ 2012 to make the best use of your hardware, Don McCrady and Jim Radigan, 10/31 11:15am. This is a “go-fast” talk that’s also a great intro into computer architecture and C++ compilers.
DirectX Graphics Development with Visual Studio 2012, Rong Lu, 10/31 1:45pm. Whether you are just getting started with 2D/3D games or you’ve been slinging vectors and models for years, there's something for you in this talk. Learn more about the new tools integrated into Visual Studio 2012 that can help you visualize graphics assets, author shaders easily, diagnose through graphics issues, and more.
Diving deep into C++ /CX and WinRT, 10/31 5:15pm. This talk is for you if you want to understand all the nitty-gritties of the language, semantics and best practices when creating your own WinRT component for reuse in your Windows Store Apps.
Performance Tips for Windows Store apps using DirectX and C++, 11/1 8:30am. Learn how to squeeze out every drop of performance and preserve battery life with your code by avoiding typical performance pitfalls.
Tips for building a Windows Store app using XAML and C++: The Hilo project, 11/1 10:15am. 18 tips based on what we learned when building Hilo, a world-ready Windows Store app built using modern C++ and Windows XAML.
The Future of C++, Herb Sutter, 11/2 12:45pm. This talk will give an update on recent progress and near-future directions for C++, both at Microsoft and across the industry, with some announcements of interest in both areas. The speaker is the lead language architect of Visual C++ and chair of the ISO C++ committee.

If you are at Build, try to catch a few of these sessions. If you were unable to make Build, watch Channel 9 for session videos. On the social front, like Visual C++ (Visual CPP) on Facebook and follow Visual C++ on Twitter!

As always, we look forward to your comments, suggestions and other feedback!

↧

Project Austin Part 6 of 6: Storage

October 29, 2012, 3:53 pm

≫ Next: BUILD: Wednesday Update

≪ Previous: C++ at BUILD 2012

Hi, my name is George Mileka. I’m a developer on the C++ Libraries team. I have been working on the Project Code Name Austin for many months with Jorge, Eric, and Alan. To learn more about what the Project Code Name Austin is, you can read this great post by Jorge Pereira.

For Project Austin, we have used ESE (Extensible Storage Engine) as the storage engine. In this blog post, I will explain why we chose ESE, how ESE works, and finally what abstractions we have created around it for our own use.

Why use ESE?

When we started thinking about how to store the pages, the strokes, and the photos in the notebook, we came to realize we need to support the following scenarios:

The application may be shutdown unexpectedly due to Windows 8 life time management. Even though the application will get a chance to save its state in the Suspending event handler, that time cannot be guaranteed to be long enough to save all the necessary state because it is not controlled by the application. As a result:

We need a way to guarantee data consistency even if the application was shutdown unexpectedly.
We also need a fast way to save the data to minimize data loss in case of an unexpected shutdown.

We expect the data to grow significantly over time; to the degree that we cannot save/load the entire data set within a reasonable amount of time. As a result:

We need the ability to be selective about what we load or save.
We need the ability to cache some of the reads or writes.

Taking these requirements into consideration, we realized that implementing our own layer would be a significant amount of work in terms of time and complexity.

At the same time, Jorge pointed out that we can use a database engine since it really does everything we are looking for.

So – it is going to be a database engine. But which database engine?

We needed a database engine that is accessible from Windows Store apps and has a relatively easy deployment story. For the initial release, we have decided to not support remote storage or sync’ing across devices to scope the work.

A few database engines came to mind:

ESE is a database technology that has been shipping in Windows for a few releases now. A process can access the functionality as it would access many other Windows Win32 functionalities: include the header (esent.h), make the Win32 calls, and link with the import library (esent.lib). ESE is already being used by various major Microsoft products.

SQL ServerCompact Edition

SQL Server Compact Edition is similar to ESE, but additionally, it has a query processor. Unfortunately, it does not currently support Windows Store apps.

SQL Lite

SQL Lite is an open-source project that provides similar functionality to SQL Server Compact.

Talking to the ESE team, it turns out that they are committed to having their engine work for Windows Store applications. Also, the fact that ESE is already in box (Windows maintains and services it), made it more attractive to us over SQL Lite (we had a pre-determined set of queries to run, so that lack of a query processor was not a huge minus). So, we decided to use ESE.

How ESE Works

In this section I will explain the high-level concepts in ESE. The most complete source for documentation is, of course, MSDN. ESE interface is a flat C API. Whether you want to create tables, define indexes, insert data, update data, or run queries, you do that by filling a data structure and passing it to the appropriate API.

Creating Tables

We create a table by passing an instance of JET_TABLECREATE4 to JetCreateTableColumnIndex4 JET_TABLECREATE4 describes the table and its columns and indexes. Columns are described by an array of JET_COLUMNCREATE instances and indexes are described by an array of JET_INDEXCREATE3 instances.

Writing Data

Writing data (inserting or updating) is done by building an array of type JET_SETCOLUMN where each element points to the value to be written as well as the column the value should be written to. Then, we pass the array to the appropriate ESE APIs (see JetPrepareUpdate/JetSetColumns/JetUpdate2).

In case of updates, ESE uses a cursor to decide where to apply the new set of values. So, we need to position the cursor at the right record (or records), and then invoke the writing API.

Note that we can build an array of only the fields we want to target.

In case of inserts, positioning the cursor is not necessary since it is going to be a new record.

Selection and Iteration

Reading, updating, and deletion are common operations that we sometimes apply to a specific row or set of rows that share some criteria.

ESE APIs allows us to achieve this by first defining a range, and then iterating through it.

Setting up A Cursor for a Range

Keys

To define a range, we first need to understand the concept of a key. We can think of a key as a fixed pointer to a specific record in a table. The pointer is set by specifying values for one or more columns -where those columns are part of an index on that table.

For example, if we have:

A table with columns: id, name, address, phone, age
An index X1 on (id).
An index X2 on (id, age).

Then, a key can be defined as

“pointer to the first row of row set with id = 5 in X1”. Or
“pointer to the last row of row set with id = 7 in X1”

Partial Keys

As mentioned above, the pointer is set by specifying values for one or more columns - where those columns are part of an index on that table.

A key is considered partial if it does not specify values for all the columns of an index. If the key partial, matching happens against the specified columns only.

In the above example, we can define a partial key as follows:

“pointer to the first row of row set with ‘id = 5’ in X2” (note that we have omitted age). This key will locate entries with ‘id = 5’, and ignore the values of ‘age’. This capability is very useful when you do not want to match the whole key.

One important aspect here is that for an index (c1, c2, c3), we can have partial keys by omitting values for columns on the right only.

In other words, the following are valid (partial) keys:

Key on c1 (partial key)
Key on c1,c2 (partial key)
Key on c1, c2, c3 (not a partial key, but still a valid key)

The following are not valid keys:

Key on c2, c3 (cannot omit c1)
Key on c3 (cannot omit c1 or c2 if c3 is to be used)
Key on c1, c3 (cannot omit c2 if c3 is to be used)

Ranges

Now that we can define keys, we just need to define two of keys to define a range. In some cases, the criteria might be the same for both keys except that we want one key to point to the first row of the set, while we want the other key to point to the last row of the same set (for example, “get me all students with age=15”).

ESE allows us to make this distinction when we are creating the key. We basically tell it whether we want the key to be a start key (JET_bitFullColumnStartLimit) or an end key (JET_bitFullColumnEndLimit).

Cursors

A cursor is a pointer to a row targeted by the next operation ESE row-specific operation. By setting the keys (above), the cursor is also set to the same row as the start key. The cursor can then be moved by calling JetMove.

ESE Range APIs

To define the ranges as described above, the user needs to do the following:

Select the index the key will be built against by calling JetSetCurrentIndex4.
Build the start key by repeatedly calling JetMakeKey for each column.

The first call to a new key must specify the JET_bitNewKey flag. This how ESE knows we are building a new key against the same index in case we building a start and end keys for example.
Note that this API does not take the column id as input because it assumes the user is passing the values in the same order the columns are defined in the selected index. This guarantees that we are building keys that may omit only ‘right’ column values.
Set the key to the first row by calling JetSeek.
Build the end key by following the same steps as the start key. However, at the very end, set the range by calling JetSetIndexRange.

Iterating

Iterating involves the moving of the cursor (JetMove) from one record to the next in the returned result set. The cursor knows the start from the start key, and knows the end from the end key. Note that both the cursor and the result set are not exposed to the user.

Row Specific Operations

Typically, we start by defining the range as described above, and iterate on the result using JetMove. The following are examples of row-specific operations.

Reading Records

We can retrieve the values of the current row by calling JetRetrieveColumn for each column we want to read.

Updating Records

We update values in the current row by calling JetPrepareUpdate / JetSetColumns / JetUpdate2.

Deleting Records

We delete the current record by calling JetDelete.

Threading and Transactions

ESE supports access to the same database from multiple threads. The recommendation on MSDN is to create a session for each thread. This becomes handy when submitting transactions form multiple threads as the session id is what is used to associate the commit transaction to the matching begin transaction. Also, all calls in between are associated through the same session id.

In the Austin project, since all our begin/commit transaction pairs happen in the same function, it is sufficient to make those functions non-re-entrant and avoid race conditions. This means that Austin calls into ESE from different threads will be serialized. This simplifies our thread model significantly – however, should we ever need to avoid serializing ESE access, we’ll need to change this part of the implementation and create a session for each thread.

Abstracting ESE

Why abstract ESE?

As mentioned earlier in this blog post, ESE interface is flat C API. Below I'm listing some of the reasons that made us create a higher level abstraction on top of ESE flat C API:

Handle the memory allocation and de-allocation safely and avoid dealing with memory buffers at the application level.
Provide type/type size safety through templates and specialized templates.
Use some of the C++ standard structures and paradigms (vectors, and iterators) instead of inflexible C data structures.
Reduce the surface of the storage layer to the application for better productivity.
Provide thread safety given that we are not using multiple sessions.

So, we ended up creating an engine class which wraps the functionality of the ESE APIs. We also created a set of classes that represent tables, columns, indexes, and cells. Along with satisfying the requirements outlined above, the engine is capable of interacting with those data structures to perform various operations (like creating tables, inserting/updating data, finding data, data deletion, etc).

Tables

Creating tables follows the same logical sequence ESE defines.

Create a table description object.
Create column description objects – add them to an array – add the array to the table object.
Create index description objects – add them to an array – add the array to the table object.
Tell ESE to create a table using the table description object.

Table Description Object

std::wstring name = L"Photo";
std::shared_ptr<s::itable> table = s::createTable(name);

Column Description Object

std::shared_ptr<s::icolumn_typed<b::uint32>> photoIdColumn;
photoIdColumn = s::createColumn<b::uint32>(L"ID", s::column_type::type_int32);
table->addColumn(photoIdColumn);

Note: There are definitely lots of improvements to be made. For example, we should not be passing column_type::type_int32 to the createColumn function since the template parameter already has this information.

Index Description Object

std::shared_ptr<s::iindex_column> activeIndexColumn;
 activeIndexColumn = s::createIndexColumnDefinition(_activeColumn, true);
 
 std::shared_ptr<s::iindex> activeIndex;
 activeIndex = s::createIndex(L"ActivePhoto_Index", false, false);
 
 activeIndex->addIndexColumn(_activeIndexColumn);
 table->addIndex(activeIndex);

The Cell Object

While table, column, and index objects hold definitions – the cell object is an instance of a column value in the database.

The only way to create a cell is through a column. A cell holds the same type as that of the column which created it.

std::shared_ptr<s::icell_typed<b::uint32>> photoIdCell;
 photoIdCell = photoIdColumn->createCell(0); // initialize with 0

Whenever there is an operation that requires passing values to the engine, we use a set of pre-created cells and populate them with the values. Then, this set is passed to the engine which knows how to traverse them, identify the parent columns, and translate the data structure to ESE C-structures before making the API calls.

void photo_table::insert(b::uint32 photoId, b::uint32 pageId, bool active)
 {
 _photoIdCell->setValue(photoId);
 _pageIdCell->setValue(pageId);
 _activeCell->setValue(active);
 std::shared_ptr<std::vector<std::shared_ptr<s::icell>>> row;
 row->push_back(_photoIdCell);
 row->push_back(_pageIdCell);
 row->push_back(_activeCell);
 table->insert(row);
 }

Running Queries

Building a Query

To build a query, we need to specify which fields we are interested in (SELECT c1, c2), and then specify which criteria (set of values) need to be satisfied in the returned rows.

 // select columns we are interested in...
 std::shared_ptr<s::iresult> result = s::createResult();
 
 result->addColumn(_photoIdColumn);
 result->addColumn(_pageIdColumn);
 result->addColumn(_activeColumn);
 
 // set staging row entries as search criteria and execute…
 
 _photoIdCell->setValue(photoId);
 _pageIdCell->setValue(pageId);
 
 std::shared_ptr<std::vector<std::shared_ptr<s::icell>>> row;
 
 row->push_back(_pageIdCell);
 row->push_back(_photoIdCell); // note : omitting field would have
 
 // resulted in a partial key.
 
 // execute
 table->read(row, result); // where values = those in row

Reading the Result

The result is just a two dimensional array. One dimension is a vector of rows (iresult_row), and the second dimension is a vector of cells (icell). To read the content, we iterate through each row, and through each cell. For each cell, we identify which column it belongs to and cast it to extract the value.

 // iterator through the result...
 std::shared_ptr<std::vector<std::shared_ptr<s::iresult_row>>> rows = result->getRows();
 
 for (auto &row : *rows)
 {
 std::shared_ptr<std::vector<std::shared_ptr<s::icell>>> cells = row->getCells();
 
 for (auto &cell : cells)
 {
 if( cell->getColumnId() == _photoIdColumn->getId() )
 {
 std::shared_ptr<s::icell_typed<b::uint32>> typedCell = std::dynamic_pointer_cast<s::icell_typed<b::uint32>>(cell);
 b::uint32 readPhotoId == typedCell->getValue()
 }
 .
 .

Source Code

You can see how we implemented this functionality by looking at the source code files. The source code is available for download on our CodePlex page.

The following folders have the files relevant for this post:

baja\storage\db
journal\models\db

We have an improved version of the storage wrapper along with a minimal client application to demonstrate the functionality. We have uploaded the sources to prototypes/storage. It is much easier to understand how the storage works by looking at those files.

↧

BUILD: Wednesday Update

October 31, 2012, 3:22 pm

≫ Next: November CTP, isocpp.org, Standard C++ Foundations Announced Today

≪ Previous: Project Austin Part 6 of 6: Storage

C++ is alive at Build 2012!

Developers have been filling the C++ sessions and getting tips, tricks, insights and example code from the experts. But you don't have to take our word -- sessions from Tuesday are available online:

And if you have 90 seconds, catch this quick chat with Herb Sutter:

(Please visit the site to view this video)

Visit C++ at Build 2012 for the larger list of sessions/summaries.

↧