Some Thoughts on Benchmarking, Levels of Portability and Mobility

I'm looking at the "mobile computing" side of mobile information technology. Yes, there is another side, which I haven't even touched on yet on this Web page. Having written a lot of product reviews over the years, I've been asking myself what I can say to a beginner?

The idea of "benchmark testing" is to measure something against a standard reference. It can be done to test adequacy or to fulfill a need or needs, or it can be done to establish superiority in fulfilling a need or needs.

In the earlier days of computing, merely proving adequacy was a major success. That level of testing was done with an eye on price point. The question was typically something like "can I do book keeping for my small company with under $5,000.00 of computer equipment?" As time went on, the size of the company asking the question might grow, or the price dropped, or other features were looked at.

As the industry grew, the nature of the question became "which computer system under $5,000.00 can do it better?"

In the area of mobile technology, there are levels of portability and mobility (the range of travel). At the "briefcase computer" level of portability, for short range mobility, laptop computers have pretty much reached the point of "which is better" analysis. At the "long range" mobility end, warranties and service become as much of a factor as hardware. The "job" being addressed can push the products (in the sense of the total package) to the point where mere adequacy may be a singular accomplishment -- if it is reached by any.

At the "pocket" level of portability the same can be said. An individual may define an acceptable physical package and a set of capabilities only to find that none of the products, at any price can do the job.

The first problem faced by anyone trying to do such testing is to decide what's worth measuring. For laptops, the first standard was to measure the usual computer concerns of speed, graphics capability and storage capacity. On top of that was added portability issues of size and weight, and energy capacity and efficiency to determine how long a computer runs on batteries. Lately, within the last 3 years, serious testing has been done regarding durability. But a lot of testing is still being done inadequately. The problem is that few magazines can afford to do thorough testing and even if the money is available, time to conduct the tests in an industry where products can change in a few months, is limited.

There is no obvious solution, but a newcomer to this industry has be warned. There is no single source of information, which taken on its own, will give you all the information you need to come to a properly informed decision regarding these products. Depending on the individual, the situation may even be worse. It may be, that depending on your needs, even with all the information published, you have not been given all you need to know to make an informed decision.

I don't have a solution for this. I can only give you that warning, and the assurance that yes, some of us really are trying to do better.
[1996/12/11]

Human Error in Testing:

If you've read my test reports in various magazines you will know that I always try to supply relevant numbers. But I've tried to make clear that the times are usually hand timed. So the question remains, how accurate are those times? There are a couple of common problems in these efforts. First, there is the problem that human reaction time is not precise. So if I time something a number of times, the results may vary. Second, there is the problem of delay and synchronization of triggering events. In most cases, I know when a process is going to start, but not when it's going to end.

Starting with repeatability, or, how accurately I can control my hand, I've put together a test. All my recent testing has been conducted using my "Casio Lithium 5 Alarm Chronograph" wristwatch stopwatch. While this may seem like a crude technique, I have tested my ability to use specialized stopwatch devices and found that in some cases, I was even less accurate. The first thing I can say, and I have often pointed out is that although I usually record the 1/100th second display, human error makes that number insignificant. Still, when I'm in good condition, I actually can get groupings of readings that are accurate within 1/10th. The test was conducted at around 22:30 on 1997/03/23. It is significant to know that I am a bit tired at the end of a busy day and not in the best condition, but still about as good as I tend to be when I conduct my various timing tests. The test is to try to stop the stopwatch at exactly 10 seconds, in 10 successive attempts. I am not rushing myself to complete the set of tests but, I am not allowing myself more than a few seconds between tests, so that the repetition allows me to "learn" from previous attempts and to readjust my timing. I tried 3 test timings to warm up. Here are the results in order:

09.92, 09.93, 09.97, 09.94, 10.08, 10.14, 09.84, 09.92, 10.01, 10.13

The numbers speak for themselves. If you want to, you can plug them into a spreadsheet and do some stats analysis, but you can see immediately that 7 out of 10 are within 1/10th sec accuracy. I'm pleased but not surprised. I have done this sort of test before, which is why I've felt that the tests I've done have generally been worth printing.

The other problem is that although I can start a test and start a stopwatch fairly simultaneously, watching for the end of a test is a problem. If the test is regular in nature, so that I can pretty much predict its completion, I can count-down the stop just as I did in the above example. If a test run is clearly off I'll know it. What's more difficult is where a test has variable results, or is a 1 shot test, such as a test of the time taken to "download" a file on the internet. For those tests, I usually try to compensate the start by beginning the timing slightly after I start the process, or estimate the "over-run" time and deduct an appropriate amount. I've never timed the start delay compensation. I do it by "feel", but knowning about how fast I expect to react.

Finally, I should mention that in my earliest benchmark tests, years ago, I used to use a Canon calculator/stopwatch. I haven't seen it around lately. It only recorded to 1/10 sec. accuracy, but the button placement, throw and general feel was so good that my timings were even more accurate than with the Casio wristwatch. Yet I now have a new "Sportline" stopwatch which I intended to use in the future, and my preliminary tests have not show as much accuracy. Perhaps my times will be better with more practice. I'll have to take some time to be sure. I strive to maintain a consistent level of accuracy. If I don't, it's all a waste of time.
[1997/03/24]

Testing Handwriting Recognition:

One problem I had to cope with testing the Newton is the way in which it learns handwriting. If you write a word, correctly or incorrectly and you leave the translated word, the Newton associates what you wrote with that particular word. For example, if you write the word "clear" and it translates it as "dear" and you don't change it, then the Newton assumes that the translation is correct. So the next time you write "clear" it'll translate it as "dear" again. Furthermore, that error will persist for a long time after because it will become part of the translation database for a while. I don't know how long the error will last. I should ask Apple for an estimate.

So how does this affect testing? Well, if the Newton makes a mistake should I correct it or not? If I correct every mistake, then the handwriting becomes more accurate as the test progresses. That is valid because that's how it is supposed to be used, and shows true potential of the system. But making the corrections takes time -- a lot of time, and it means my record of the results is being destroyed while the test goes on. In doing the tests I've done so far, I didn't make corrections during each test. Although that shows the Newton at its worst, it preserved the record of the resulting translations so I could consider the scoring method.

A perfect test of a recognition system like the Newton might have an extremely sophisticated robot arm write a set series of words, phrases and sentences with a known degree of accuracy and with known anomalies. In that way, recognition systems could be accurately measured for their ability to overcome the known problems. Financially that's out of the question.

The next best test would be to have a number of people whose writing are known to be representative of the range of potential buyers and have them write the same things and videotape the handwriting and the results. Errors could be corrected during the process, simulating the proper use of the Newton, and thus fully using the "learning" capability, yet without losing the record of the results.

Unfortunately, I don't have enough money to hire the required number of people or do the studies necessary to find the "right" people.

The method left to me is to simply take as many people as I can get and test the Newton writing relatively typical words and phrases.

Fine. So how many people can I arrange to do this kind of test? Well, there's me. . . .
[1997/02/12]

Book Index

Send messages to jimomura@pathcom.com or call (416) 652-3880.


Last update 1997/09/25.


Copyright 1996 by James Omura, Toronto, Ontario, Canada