Implementing Web Services for Large or Slow Data Sets

an article added by: Sonja Lande at 06012007


In: Categories » » AJAX » Implementing Web Services for Large or Slow Data Sets

In the context of an Ajax and REST application, Web services that expose large data sets or slow data sets deserve special attention because of the requirement that the resulting solution be as efficient as possible. This article covers the following aspects of implementing aWeb service that exposes large data sets or slow data sets:

• Understanding the context of what a large or slow data set application is

• Outlining the overall architecture of the solution

• Determining how an application should be architected in coding terms

Problem You want to create Web services that expose large data sets or data sets that take a long time to generate.

Theory Many developers experience the need to show a huge number of records to end users. The first reaction of most developers to this issue is, “No, it can’t be done.” Yet when you look at the Google and Yahoo! search engines, you see that it can be done. This article sets out to solve such a problem specifically, how to display 64,000 records in aWeb browser. Please note that the solution for a large data set or slow data set Web service is very specialized and should not be used as a general solution. The added complexity of the solution makes it impractical for use in every application. Web services that are a single request and response are simple and require no management of state or callbacks. In this article’s solution, state and callbacks are required. Efficiency is one of the requirements of this solution, but remember that efficiency is relative, and the solution will be as efficient as possible for the context.

As you know, in any search engine, you enter a term or phrase in a text box, click the Search button, and the relevant HTML pages are returned for the term or phrase you typed in. Whether the search engine presents useful results is not the point of this solution. What is relevant is the fact that a search was executed that resulted in an HTML page displaying 10 results per page of the roughly 175,000,000 available results. The HTML search result page looks good, and the search took only 0.13 seconds. The search speed should amaze users, but I tend to be cynical. I am sure that the 0.13 seconds is not a lie, but the question is, what does the 0.13 seconds measure? Is the 0.13 seconds a measure of having found the search terms in 175,000,000 pages? I doubt it, because if this were the case, it would mean each page was found in 0.00000000074286 seconds, or each page was found in two clock cycles of a 3GHz CPU. These statistics should make anybody wary of the results found, even if parallel processes were involved. So if the statistics are very approximate to the point of being irrelevant, what is actually going on? The search engine is solving the problem of large or slow data sets using an illusion. The illusion is that the search engine is presenting you with the information in a fast manner, even though you are seeing only an extremely small sliver of the total information.

It is not difficult for a search engine to search its indices and return 10 results of a huge data set. As the 10 results are generated and returned as a single HTML page, the second and third batches of results are being generated. I would even hazard a guess that Yahoo! generates a result list of 100 found links. I guess that it is probably a result set of 100 elements because at the bottom of the page there are 10 page links of search results. Multiply 10 and 10, and you get 100 links. What is very interesting is how Yahoo! allows you to retrieve the result set. Consider the URL generated by the query:

   http://search.yahoo.com/search?p=really+big+search&  
 fr=FP-tab-web-t500&toggle=1&cop=&ei=UTF-8

Based on your experience with the REST URLs discussed in the previous articles, you should be able to guess what the individual query parameters do. Before I explain what I think the query parameters do, let’s look at the URL that appears after clicking the second page link (indicated by “2” at the bottom of the results):

   http://search.yahoo.com/search?p=really+big+search&  
 toggle=1&ei=UTF-8&fr=FP-tab-web-t500&b=11

From this URL, it seems that /search is a root URL from a REST perspective. So does this mean that executing the URL /search would return all search results? In theory, yes, but I doubt it actually would, because this would mean returning an alphabetical listing of all HTML pages indexed in Yahoo!, which is completely impractical from an implementation perspective. I think Yahoo! is using a REST-ful approach, because if you click the Images link at the top of the results page, the following URL is generated: http://images.search.yahoo.com/search/images?p=really+big+search& toggle=1&ei=UTF-8&fr=FP-tab-web-t500&fr2=tab-web Notice that the Images URL is /search/images, meaning that images belong to the search URL, but they represent a more specific type of search. In a REST Web service approach, this represents the ideal naming strategy. Also notice how the query parameters for a general and image-based search are identical. This again illustrates a well-engineered REST Web service. The parameters of the URL represent a way of filtering the huge amount of search results. The first method involves filtering on the query itself, which is defined by the p query parameter. This is a very clever way of defining a result set, because it means that multiple people searching for the same terms will see the same view. From a search perspective, Yahoo! does not distinguish between users.

Since multiple people can see the same results, a way of identifying which results are to be returned needs to be identified. The b query parameter serves that purpose in that it defines the start index of the search results. If you were doubtful that Yahoo! keeps a server-side result set, then the b query parameter should convince you of this. The b query parameter is a numerical value that in effect says, “Please return the links available at the list indices 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 of the search result set really big search.” The Yahoo! example illustrates that a key component to presenting large data sets is creating an illusion. The illusion in this case is presenting a subset of the results, with the remainder of the results presented when they are requested. So if a SQL query results in 64,000 records, the illusion is to present the top 100 or so results, while in the background preparing another batch of 100 or so results. The following list presents the attributes of a situation in which you would apply the techniques defined in this article:

• The data set could be construed as infinite. Even though there are theoretical limits, from a practical perspective the data set seems infinite. For example, any mathematical algorithm that generates a series of data, such as all squares of the numbers 1 to 1 million, could be construed as infinite. Another example is a search engine linked to an enormous database that generates a huge result set when the database is queried.

• The data set is not available at the time of the request. In this context, you make a request that triggers a sequence of events. Because the events require some computing time, the results are not available immediately. For example, in the case of a search engine mashup, delivery of the results from the individual search engines requires a small amount of time. Another example is the profit and loss calculation for an investment portfolio.

• The data is a single block of many elements, and the block must be considered as a single contiguous piece of data. Solution The architecture for large or slow data sets requires a server component that supports multithreading or multiprocesses, and a client component that includes a two-channel communication mechanism The architecture is an example historical ticker application. The client has two interactions: HTTP POST and HTTP GET to the same URL, /services/historical/*. At a technical level, the two interactions are separate, but they often work together. The POST interaction is used to send data, and the GET interaction is used to retrieve data. Technically, a REST Web service implies such an interaction: to start a task, a POST request is sent, and the answers for the task are retrieved using a GET.

Two channels are used so that the client can receive multiple answers to an executing task. This ability is a necessity when you are working with large or slow data sets. You cannot execute a task and wait for the complete answer, even though using XMLHttpRequest in asynchronous mode would not stop the browser from functioning. You want to retrieve data on a piecemeal basis because you want to display the results as soon as they are available. On the server side, two components implement the interactions: TaskManager and ResultCache. TaskManager responds to all POST requests and executes the appropriate task. The individual tasks then generate their results and add them to ResultCache. When the client executes a GET, the data from ResultCache is retrieved. TaskManager is only responsible for executing the tasks and managing the reference to ResultCache. TaskManager is not responsible for knowing the details of the results generated or the nature of the task that is executed.

Managing the Client Request/Response Cycle For the moment, let’s ignore the URLs and the server side, and focus on the client side. We can define the most amazing URLs and the most interesting server side, but if the client side functions inefficiently, everything else does not matter. Defining what the client can and cannot do goes a long way toward defining the server URLs and server-side code. First, let’s look at a mapping example along the lines of http://map.search.ch or http://maps.google.com. You will see the logic to make the scrolling of the map smooth, as the map is composed of a number of image pieces

1. The client surfs to the URL and loads the HTML page that contains the view port.

2. The HTML page sends a POST asking to load four map pieces.

3. TaskManager begins a task that retrieves the requested map pieces. The task adds the map pieces to ResultCache.

4. The HTML page executes a periodic loop that asks if the requested map pieces have been added to ResultCache. If the pieces have been added, then they are retrieved and displayed. At this moment, there is a disconnect, as the identifier of the map pieces has not been defined. When the HTML page loads, which map pieces are loaded? Do you load the default? And if you do load the default, what exactly is the default? When loading the map pieces, you need to first determine how to identify a map piece.Defining the Constraints of the Client The mapping example shows what you need to do to manage large or infinite data sets, although all applications that work with large or infinite data sets have the following common attributes on the client side:

• The referencing of all data can be defined. In the case of the mapping application, this means assigning coordinates to all map pieces. You want the data to be prefixed with a reference so that the client can jump to the data using a calculated URL approach. This does not mean that data available is at the URL, as the server side might not yet have generated that data. The references can be timestamps, coordinates, or an incremental counter, but they must possess the ability to be determined before the data is loaded.

• The client will directly reference the view port data and the data that will probably be viewed. The data that will probably be viewed is the look-ahead data that is preloaded. The algorithm to determine the data that is probably going to be viewed is completely dependent on the application and the user interface of the application. For example, most mapping applications have an arrow to move the map up and down by one map piece. If a user interface were to offer an arrow called “Jump 100 Units,” then the selection that is probably going to be viewed would include the immediate as well as the “Jump 100 Units” map pieces. The idea behind the data that is probably going to be viewed is to let the client or server preload the information, which makes the iteration of the data seem like one smooth process.

In our case, we are building a stock ticker viewing application. The client will have the capability to navigate a list of stock tickers and then view the history of those tickers. In the snapshots, the historical data is not illustrated, because the focus is on the navigation and data manipulations. List navigation makes or breaks the functionality or performance of the client. In the case of the mapping application, users navigate by using the mouse to select and drag map pieces. In the case of the ticker example, users navigate using the arrows by moving the mouse over a ticker. Under no circumstances should you use a list box or a combo box for navigation these elements cannot hold a large amount of data, so navigating with them can be an unpleasant experience.

Slashdot (www.slashdot.org) has managed to solve the problem of data sets being too large by using personal preference techniques. Something people like about Slashdot is the fact that they can read a story and then post their opinion on the topic sort of like a article club for Internet geeks. Controversial Slashdot stories can garner over 300 postings, some of which are garbage but others of which are interesting.

Slashdot was the first Web site to successfully employ a technique called metamoderation. Metamoderation is when the readers are their own referees and determine whether a posting is interesting. Good postings that are funny or interesting have a higher ranking than those that are drivel. Using their personal preferences, readers can choose at what level they want to read messages in other words, they can select to read only the most interesting postings if they like. It is a system that works. The metadata is an index for the navigation that is created on the HTML page.

legal notice

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

Useful tools and features

Link to this article from your page    Send this article to you or to a friend
If you like this article (tutorial), please link to it from your web page using the information above.

related articles

1. Understanding the Definition and Philosophy of REST
Understanding the Definition and Philosophy of REST REST is a controversial topic among Web service enthusiasts, because it’s considered to stand for the opposite of what Web services and SOA are trying to achieve. The problem with this thinking is that REST is not in contradiction with the abstract definition of SOA and Web services. REST is in contradiction with technologies such as SOAP, WSDL, and WS-* specifications. The following offers a quick definition of REST:...

2. The Easiest Way to Get Started with Ajax and REST
The Easiest Way to Get Started with Ajax and REST Problem You want to know the best way to get started with writing Ajax and REST. Solution When developing an Ajax and REST application, you must decide on the tools and frameworks you’ll use. The choice is simple: Use whatever you’re using today, and write some Ajax applications. You don’t need to change the tools you’re using today. Whether you’re using ASP.NET, JavaServer Pages (JSP), PHP, Ruby, or Python, you...

3. Testing a Dynamic Contract with Ajax
Coding the Contract Using Test-Driven Development Techniques Coding the contract using agile and test-driven development techniques requires writing a number of tests and implementing aMock URL layer. Problem You want to code the contract using these development techniques. Solution To demonstrate, let’s define a use case, implement the use case as a contract, write a test case(s) to implement the contract, implement the contract in the Mock URL, and finally...

4. Testing the Client Side Logic
Problem You want to effectively test your application’s client-side logic. Theory Testing GUI code tends not to be a productive task because of the complications that arise. The main complication is how to test the correctness of a user interface. Imagine a situation where clicking a button causes a table to be filled with data. Now imagine that when a check box is checked and the button is clicked again, a different table is filled with content. The fact that clicking the same button results in two ...

5. Understanding JavaScript and Types
Understanding JavaScript and Types Problem You want to work around the fact that JavaScript does not have types declared for its variables. Theory JavaScript code does not have any variables with a declared type. The lack of typed variables is apparent when you declare functions. That said, not having typed variable declarations does not mean JavaScript has no types or no type safety. Let’s start out with the simple declaration of a function, as illustrated by the following ex...

6. Coding Using Conventions and Not Configurations
Coding Using Conventions and Not Configurations Problem You want to make your JavaScript constructs more efficient by applying the Rails “convention over configuration” principle to them. Theory You may already be familiar with the programming platform Ruby on Rails, which is used to build Web applications. The focus of this recipe is not Ruby on Rails, but one aspect of Ruby on Rails namely, convention over configuration (see http://en.wikipedia.org/wiki/ Ruby_on_Rails for m...

7. Advantage of parameterless functions in JavaScript
Using Parameterless Functions Problem You want to take advantage of parameterless functions in JavaScript. Theory JavaScript functions for the most part have parameters. You may think that the previous sentence states the obvious after all, without parameters, what data could be passed to a function? JavaScript has the ability to declare functions that have no parameters, even though the caller of the function has passed parameters to the function. For example, let’s look at...

8. JavaScripot Functions
Treating Functions Like Objects Problem You want to take advantage of the fact that functions are objects (remember, everything is an object in JavaScript). Theory Many people think that a function is some keyword used in JavaScript. A function is also an object that can be manipulated. Knowing that a function is an object makes it very interesting from the perspective of writing JavaScript code, because the code can treat the function like another other object. This mean...

9. Implementing an Error and Exception Handling Strategy
Implementing an Error and Exception Handling Strategy Problem You want to implement a clean error and exception handling strategy in your applications, to make them run more smoothly. Theory Of course, you might argue that one error is a dialog box and the other is generated in the JavaScript console. The fact that one browser uses a dialog box to show an error and the other does not is a browser issue, not an error issue. A concise way of classifying the two errors is to ...