More on logical models differing from physical ones
On the topic of logical models and physical models different, I found the following short anecdote in Paul Oldfield’s Domain Modeling whitepaper:
Anecdote: One project for which I worked, and against my advice, chose to design an OO system based on a data model. They made no attempt to get the responsibilities in the right place, despite an explicit requirement from the customer that the system be flexible in response to changing requirements. After major redesigns in response to changing requirements, the plug was pulled on the project before it delivered any code at all to the customer. It had become clear that the design process being employed could not keep up with the rate of change of requirements, and never would.
I’m in the process of designing a new application. I can’t say what it is yet, but I can say that there is a difference between the physical and logical models. Roughly, the problem corresponds to the image below.

First, lots of external data is collected from some source. The data goes into the unshaded white tables.
Then, through some process, we translate data from those tables into another set of tables. We do this in the database both so that we can have before and after copies for sanity checking purposes, but also so that the rules and constraints in the database can act as checks on the transformation itself.
Only the shaded tables in the physical model end up being relevant for display to users on the front-end. However, those tables are highly normalized and may have a variety of peculiarities about them (from a domain modeler’s perspective); those peculiarities come from the necessary imposition of the the database constraints and heavy normalization we needed to guarantee the data’s integrity.
So the third step is to translate these physical tables into classes that more logically represent our problem. The logical model hides all of decisions of the data layer for the programmer who is creating the web front-end. The input tables disappear, and the other tables are recombined in ways that make sense for the application. Some normalization can be lost because it’s not meaningful here.
Certainly, I could create a front-end that functions perfectly well without this layer. But it will be much easier to create and modify the front-end in the future with this layer in place. Although it seems like “work” to create the new abstraction layer, it vastly simplifies the problem of creating the front-end once the translation layer exists. The act of “translation” happens in one place, not in every controller or view that needs to work directly with data from the database.
In fact, as I’ve been arguing already (though not on this blog), you are doing this step all the time already, because no matter how you think of it, you do have to somehow translate data from the data layer into the format it ultimately gets displayed as on the front-end. The difference is that this model recognizes this process and brings all of the code that accomplishes it together in one place. That makes adjusting to changing requirements, or keeping database changes hidden from consumers a much simpler task.
The rub is that one needs to learn how to design the “domain inspired logical model.” Right now that seems to be the biggest gap. ORMs like ActiveRecord trick you into thinking that the data model and the object-oriented logical model are the same, and that’s something we need to overcome.
Physical vs Logical Modeling in Rails
I gave a talk on Tuesday at the Boston Ruby User’s Group meeting. The talk was quite exciting, with about 60 people split rather evenly between skeptics and converts to my message. Although they are fairly sparse for consumption without having first heard the talk, the slides are attached. As I understand more and more of the initial objections folks have to splitting up domain and data modeling this way, I’ll be working toward a more comprehensive argument and set of examples. Yes, ideally for the next edition of my book. Here’s the
PDF.
Enterprise Rails available for Pre-order on Amazon
As the title suggests, you can now pre-order my book on Amazon.
Materialized Views in Postgres
I got a lot of great feedback on my tutorial at PGCon. The slides, as well as a sample chapter from my upcoming O’Reilly Book are posted on the PGCon web site. You can also grab them here:
- Slides from PGCon 2008: Materialized Views That Work.
- Sample Chapter: Materialized Views from Enterprise Rails (O’Reilly Media).
I highly recommend the sample chapter, as it takes you through an example in prose form. The slides from the talk are pretty content-heavy but some code slides may be difficult to grasp without commentary.
hash_extension updated for Rails 2.0.x
Thanks to Wayne E. Seguin for pointing out that extract_options_from_args is gone in Rails 2.0.x. hash_extension 0.0.3 now works in 1.x and 2.0.x. To upgrade:
sudo gem update hash_extension
To install for the first time:
sudo gem install hash_extension
For more info, check out the docs.
PGCon 2008
My presentation on materialized views in Postgres, and especially materialized views with time dependencies, has been accepted to PGCon 2008.
Abstract: Complex queries repeated in application code often are recast as views. This makes programming easier, but it does nothing to improve the performance of the underlying query. Materializing the view is the next logical step. Materialized views are stored in standard tables, so joins, procedure evaluation, and conditions from the original query are pre-computed before queries are run against the materialized view. Because the data is in a table, performance can be further improved by adding indexes.
Many developers shy away from materializing views in Postgres because the functionality is not built in, as it is in Oracle. In this talk, I will present a straightforward approach that can be applied to materialize any view, even those that Oracle cannot handle. Case in point: views that have a time-dependency (such as those based on the current time, now()) don’t have convenient trigger points (such as commits) to signal that the view should be refreshed, and are a major stumbling block for view materialization. This talk presents a technique to handles those cases as well, bringing materialized views even more powerful than those available in commercial products to Postgres.
Hope to see you in Ottawa!
Speeding up ActiveRecord with Hashes, Take 2
A few weeks ago, I posted about the release of my gem, hash_extension, which makes Ruby hashes act a little more like plain old objects. That’s a good thing, because ActiveRecord requests that return hashes instead of full-blown ActiveRecord objects are about 50% faster. 50% is not a performance tweak — it’s basically the best thing that’s ever happened to ActiveRecord.
In version 0.0.1, the only way to get back hashes from ActiveRecord for use with hash_extension was to use select_all:
Foo.connection.select_all("select * from foo")
In the docs for hash_extension I put out a call for someone to extend ActiveRecord itself to return hashes from more natural, ActiveRecord-esque methods. Elliot Laster answered that call, and now, in version 0.0.2, find_as_hashes and find_by_sql_as_hashes are now available:
Foo.find_as_hashes(:all)
Foo.find_as_hashes(:first)
Foo.find_as_hashes(:all, :conditions => "bar = 'baz'")
Foo.find_by_sql_as_hashes("complex sql goes here")
To learn more about the gem and to download, go here.
ActiveRecord is slow. Hashes are fast.
Introducing the hash_extension gem…
…the first gem associated with my upcoming book, Scaling to Enterprise with Ruby on Rails.
Are you tired of hearing that Ruby is slow? Well, Ruby is slow, in many ways. The trick to a fast site is to not use the parts of Ruby and Rails that are slow in places where performance counts. For example, loading ActiveRecord objects happens to be extremely slow. Something simple like the following statement may take very little database time, but then will spin through the slow process of ActiveRecord object creation.
MyObject.find(:all)
On my dual core macbook pro, on a table with 40k records, this takes 7 seconds of Ruby time. Conversely, the following query, which returns not an array of ActiveRecord (MyObject) objects, but an array of hashes with all the same properties, takes just over 3 seconds:
MyObject.connection.select_all("select * from my_objects")
So if you don’t need the associations or methods that come with the full ActiveRecord version of your data, you can save a lot of Ruby cycles by using hashes instead — over 50% of the overhead. The problem is that the two statements above are not drop-in replacements for each other. Objects follow dot notation (f.attr) whereas hashes follow, well, hash notation (f['attr']). So to switch to the hash result, you would have to update all your code to follow hash notation instead of dot notation, and that’s a pain (not to mention ugly).
hash_extension to the rescue! This gem allows you to access hashes just as you would regular objects. With this gem, the following is possible:
>> hash = Hash.new >> hash.foo = 'bar' >> hash.foo => 'bar'
Now the two statements above are interchangeable. If you have slow pages in need of tuning, and you’re loading lots of object for display purposes only (e.g., you don’t actually need the weight associated with the full objects), this is an easy way to eek out some more performance.You can download the gem here and read more about how to set it up and use it here.
Geographic Distance in Postgres
To help proliferate the idea that the database is very much a part of your Rails app, I’m going to be posting useful tidbits of PL/pgSQL that can help you application go and go fast. Feel free to send in your own PL/pgSQL snippets and I’ll post them here as well.
I find that when folks say Ruby/Rails is slow, they’re doing things in Ruby/Rails application code that they shouldn’t be. The following PL/pgSQL function will compute an approximate distance between two sets of latitude/longitude pairs.
create or replace function miles_between_lat_long( lat1 numeric, long1 numeric, lat2 numeric, long2 numeric ) returns numeric language 'plpgsql' as $$ declare x numeric = 69.1 * (lat2 - lat1); y numeric = 69.1 * (long2 - long1) * cos(lat1/57.3); begin return sqrt(x * x + y * y); end $$;
Why is this useful? Well, if your web application needs to find distances, such as “drug stores within 10 miles of zip code 02139,” then you can construct a single SQL query that will give you the answer. This can be far more efficient than loading the entire zip code database in memory and trudging through the data in the application layer, which may not be optimized for doing these sorts of computations. For example, assume you’ve got a table drug_stores with latitude and longitude columns, and a table zip_codes that has the latitude and longitude for each zip code in the United States. To find the drug stores within 10 miles of 02139, you issue the following query:
select *
from drug_stores d,
zip_codes z
where z.zip_code = '02139'
and miles_between_lat_long(d.latitude, d.longitude, z.latitude, z.longitude) < 10;
