How and when to use Group by Having and Distinct

an article added by: Daniel Carlson at 03252008


Programming :: How and when to use Group by Having and Distinct ::

 French | Spanish | Portuguese | Italian | German | Japanese | Chinese | Korean | Russian | Arabic Bookmark and Share

After learning how SQL works internally, you can better learn how to put the more complicated aspects of it to use. First we need to clarify the difference between GROUP BY and DISTINCT. Quite a bit of confusion exists over these two clauses, because they often produce the same query results but for very different reasons. In fact, GROUP BY and DISTINCT are not related in any way whatsoever. The GROUP BY clause produces as its intermediate work product a table of rows representing each group, and a sort of “invisible third dimension” lies behind each row, containing the constituent rows that make up that group. The constituent rows “sitting behind” each group contain the exact same values in each of the columns specified in the GROUP BY clause; all columns not specified in the GROUP BY clause can have different values in these constituent rows. By comparison, DISTINCT looks at each row produced by the SELECT clause and simply discards all duplicate rows so that each row in the final query result is unique (or distinct). DISTINCT looks at the combination of all columns in the SELECT clause to determine whether one row is identical to another. If the GROUP BY clause of a statement contains all the same columns contained in the SELECT clause, the final query result is the same as a similar statement that doesn’t contain a GROUP BY clause, but instead contains a DISTINCT clause. Because every column in the SELECT clause is also specified in the GROUP BY clause, the group is defined as any distinct combination of those columns, so the final query result contains all such unique combinations because of this grouping mechanism. On the other hand, DISTINCT simply discards all but the first occurrence of each unique combination of the columns in the SELECT clause. The two results are the same but for very different reasons. If you are not fully aware of these reasons, you ultimately write incorrect SQL in your application. Read the following parts to gain more insight into how the GROUP BY clause works with HAVING and aggregate functions— something DISTINCT cannot do. As shown in the code, you can use the HAVING clause to find out how many Employees have two or more dependants. But how does the SQL statement determine this fact? The HAVING clause criterion in the listing is simply as follows:

HAVING Count(*) >= 2

Who said anything about Dependants here? No one; it is implied. You see, whenever you create a group, that “invisible third dimension” of constituent rows for that group contains one row for each element in that group. Because you group by the columns that you SELECT from the Company and Employee tables, one constituent row for each Dependant is joined in the FROM clause. So the HAVING clause performs its calculations and functions on the “invisible third dimension” of constituent rows “sitting behind” each group. Take, therefore, the following line of code:

HAVING Count(*) >= 2

Using that code in a listing is like saying the following: For each group HAVING at least two constituent rows sitting behind it. Another way to visualize the HAVING clause is to think of it as a WHERE clause that operates on those constituent rows sitting behind each group.

SELECT
   e.LastName + ‘, ‘ + e.FirstName AS EmployeeName,
   count(*) as NumDependantsAdded13OrOlder
   FROM
   Employee e INNER JOIN Dependant d
   ON e.SSN = d.SSN
   GROUP BY
   e.LastName,
   e.FirstName
   ORDER BY
 EmployeeName ASC

Now you can add a HAVING clause as shown in the code so that you retrieve only Employees whose youngest Dependant was added to the database.

SELECT
   e.LastName + ‘, ‘ + e.FirstName AS EmployeeName,
   count(*) as NumDependantsAdded13OrOlder
   FROM
 Employee e INNER JOIN Dependant d
ON e.SSN = d.SSN
   GROUP BY
   e.LastName,
   e.FirstName
   HAVING
   Min(d.Age) > 13
   ORDER BY
 EmployeeName ASC

To aggregate means to gather multiple elements into a whole. An aggregate function operates on a set of rows and returns a single value representing the aggregate of those rows. For instance, Avg(Salary) returns the average salary of the selected employee rows, so Avg( ) is considered an aggregate function. Aggregate functions are also known as set functions because they operate over a set of data. The best way to determine when you can and cannot use aggregate functions is to carefully listen to the English language version of the question (that is, the query) that you are asking the database. “What is the average salary of all Employees born after July 4, 1975?” is, for example, a straightforward, well-formed question, so it easily fits into an SQL statement, as the following code shows:

SELECT
   Avg(Salary)
   FROM
   Employee
   WHERE
   DateOfBirth > ‘07/04/1975’

But what about “What is the name of every Employee and the average salary of all Employees born after July 4, 1975?” Is this a valid question? Actually, it is two separate questions that are unnaturally “smushed” together. The following query that would attempt to answer this question is equally unnatural and would, therefore, throw an error:

SELECT
   LastName + ‘, ‘ + FirstName AS EmployeeName,
   Avg(Salary)
   FROM
   Employee
   WHERE
 DateOfBirth > ‘07/04/1975’

How can you retrieve both individual employees and the average of all employees as a single query result set? You can’t! These are two different sets of data with two totally different dimensions, and one has absolutely nothing at all to do with the other. The fact that both questions are about Employees makes no difference. As do all functions, the Avg() function returns a single value—in this case, the average of the values in the Salary column for the set of data described by the WHERE clause. That’s one value, meaning that it appears exactly one time in the final query result. In other words, it has a dimension of one row by one column. On the other hand, consider the column defined by the following code:

LastName + ‘, ‘ + FirstName AS EmployeeName

EmployeeName is not a function, nor does it operate over a set of data. As you would expect, this query column definition returns one value for each row in the data described by the WHERE clause. In other words, it has a dimension of N rows by one column. Clearly, the averaging function cannot co-exist with the column display in the same SQL statement because they are of different dimensions. These same rules of operation hold for all aggregate functions, such as Min(), Max(), Sum(), and so on. If GROUP BY is present in an SQL statement, the set of data over which a set function operates changes from the one entire data set described by the statement to each separate set of constituent rows sitting behind each group. Do you see why that visualization is so useful? By showing that invisible third dimension, you can now visualize the result of the set function for each set or group of rows in the result set. If you want to find the average age at which the Dependants of each Employee were recorded in the database, for example, you would execute the following query:

SELECT
   e.LastName + ‘, ‘ + e.FirstName AS EmployeeName,
   Avg(d.Age) as AverageAge
   FROM
   Employee e INNER JOIN Dependant d
   ON e.SSN = d.SSN
   GROUP BY
   e.LastName,
   e.FirstName

And the result would be as follows:

EmployeeName AverageAge
   Churvis, Adam 55
   Davidow, Allen 30
   Silverberg, Bethany 15
   Sanders, Billy 10
Thor, Dirk NULL
   Rodriguez, Kiki 17
   Lester, Lance 1
   Romanova, Natasha 38
   Kayashunu, Oksal 10
   Kokilas, Susan 2
   Williger, Timmy 38
   Haymen, Turk 16
 Feuilliette, Valerie 59

This time, the GROUP BY clause tells the Avg() function to operate over each employee’s “group” of Dependants rather over than the entire data set all at once. But why can you display the function result and the query columns together in the same query? Because the dimension of the Avg() function’s results is exactly the same as that of the other query columns in the SELECT clause. The Avg() function returns a single value for every group, as do the column definitions in the SELECT clause. We hope that these visualizations of how SQL works internally helps you write SQL statements that don’t throw errors because you’re trying to do the impossible but that give you the results that you’re looking for. The difference may seem casual, but it really points to a new way of thinking about your code, as follows: Errors are good. What?! The infidel speaks blasphemy! No, not really. Let us explain. So far in this article, you’ve learned how to design databases that plug up all the little holes that you may have had in your previous database designs so that only purified data goes into your tables and referential integrity is indeed preserved under all possible circumstances. This extra “tightness” inherently causes many more errors to be thrown by your database whenever incorrect operations are attempted. If you’ve followed along so far, you now know, for example, that you should define alternate keys on unique columns that are not primary keys, such as a user’s e-mail address. Once you have an alternate key defined on the e-mail address column, you can safely try to insert the user no matter what e-mail address that he gives. If a duplicate e-mail address is already in the database, an exception is thrown. But this exception is no longer considered an error, because you are going to programmatically utilize it in your ColdFusion code to inform the user of the specific problem that he is facing and perhaps also route him to the next logical step in resolving his problem. Now that’s good programming! Previously, if someone attempted to add more than one user with the same e-mail address, you probably had some ColdFusion code attempt to find any users with the same e-mail address. And if RecordCount was greater than zero, you told the user “No go.” The code shows an example of structured exception handling in action. It’s going to catch exceptions where the database throws an error because of an alternate key violation and then handle those exceptions by sending the user to an alternative registration page that informs him of the problem and enables him either to log in by using his existing e-mail address and password or to e-mail his forgotten password to himself.

legal disclaimer

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

related articles

1. Modifying a Company in the Database
The data that CFQUERY requests from the database comes back in a result set, and the Name attribute tells ColdFusion what that result set is to be named. The SQL statement consists of three clauses: SELECT, FROM, and ORDER BY. SELECT tells the database which columns to retrieve from the database; FROM tells the database which table to retrieve those columns from; and ORDER BY tells the database how to sort the results. The result set returned from CFQUERY contains multiple rows of data, and each row has multiple columns. It would ...

2. Building the company edit action template
The code in CompanyEditAction.cfm is like the code in CompanyAddAction.cfm, but the edit action updates rather than inserts. Create a file named CompanyEditAction.cfm inside the Ch02 directory, type the code into the file’s editing window, and save the file. <cfquery name=”UpdateCompany” datasource=”#Request.MainDSN#”> UPDATE Company SET CompanyName = ‘#Trim(Form.CompanyName)#’, Address = ‘#Trim(Form.Address)#’, City = ‘#Trim(Form.City)...

3. Adding a New Employee to the Database
The CFQUERY in CompanyDeleteAction.cfm uses a DELETE statement with two SQL clauses: DELETE and WHERE. DELETE tells the database the table from which to delete a record, and WHERE tells the database which record to delete. You can watch the company delete process in action. Point your Web browser to http://<yourserver>/CFMXBible/Ch02/CompanyGetDeleteForm.cfm and enter a CompanyID. (To get a valid ID, go to the company list and pick a number from the ID column.) Click Submit to see the chosen company’s informatio...

4. Modifying an Employee in the Database
The DateFormat() function around the DateOfBirth column in the codereturns the employee’s birth date reformatted according to a display mask. DateOfBirth normally comes back from the database in the following format: 2002-01-01 00:00:00.0 That format is not very user-friendly. Calling DateFormat() with a mask of “mm/dd/yyyy” returns the date as follows: 01/01/2002 This version is, of course, more natural and easy to read. The same is true for the employee edit process...

5. Removing an Employee From the Database
The user must have the capability to remove employees from the database. The employee delete process is a simple combination of techniques that you have already learned, such as retrieving a record from the database, displaying that record in a template, and so on. The first page in this process is nearly identical to the employee get edit form. Create a file named EmployeeGetDeleteForm.cfm inside the Ch02 directory, type the code into the file’s editing window, and save the file. <html> <head> <ti...

6. Making direct links to the forms
Say that you want to modify or delete a company. Right now, you need to remember the company’s ID, go back to the launch pad, click Company Edit, and enter the company ID, all just to get to the edit form. Wouldn’t you rather click a company in the list and go directly to the edit form? ... <table> <tr> <td><b>ID</b></td> <td><b>Name</b></td> <td><b>Address</b></td> <td><b>City</b></t...

7. Using a Custom Tag
Custom tags are reusable, developer-authored extensions to the ColdFusion language. The custom tag that you create in the following parts displays today’s date in a familiar format. Create a file named TodaysDate.cfm inside the Ch02 directory, type the code into the file’s editing window, and save the file. <cfoutput>#DateFormat(Now(), “ddd, mmm d, yyyy”)#</cfoutput> Now() returns the current date and time, and DateFormat() reformats the date. To call your custom tag, open index.cfm, a...