break
Jun 12

Hello World! It’s been a while since I’ve written a post but I wasn’t away. I was finishing my Master of Mathematics in Computer Science at the University of Waterloo.

In order to finish this, I wrote with my advisor a research paper called “Developers Like Requirements Specifications; Project Managers Don’t and a Possibly Transcendent Hawthorne Effect”.  If you are interested in reading the whole paper you can find it in Here. The research paper was accepted by the EmpiRE 2011 International Workshop on Empirical Requirements Engineering.

I’ll post the abstract in here just to give you a gist of my research:

ABSTRACT

This paper reports the results of a case study conducted in July 2010 of one industrial software development project to determine how the project’s lack of any explicit requirements gathering process affected the project’s development and the product that it produced. The study reveals that the lack of any requirements gathering process led to missing functions in the product, reduced productivity among the project’s members, and poor cost estimation. This lack converted a potentially profitable project into a liability. In the end, the project members completed the product, but much time was wasted. A requirements specification could have saved this time.

Conducting the case study resulted in an increased awareness among the study’s subjects, i.e., the project’s management and members, that a requirements engineering process was needed. This awareness led to a Hawthorne effect, in which the project management and members improved their requirements process. The next project conducted by the project management was begun with an explicit requirements gathering process. This improved process continued through at least May 2011, 11 months after completion of the study.

Aug 23

I wrote this article for my course called: “Advanced Topics in Computer Science: Health Informatics”.

Introduction

Accurate knowledge about patients and diseases is critical when clinical decisions are taken. According to [8], improvement of medical knowledge depends upon the ability to analyze practice outcomes and apply them to the patients. However, to analyze these outcomes, we need data that is comparable. To have comparable data, all the parties involved need to understand the same vocabulary. As a result, the single greatest obstacle to comparable data remains common clinical vocabularies. After all, the data we store one day might be difficult to interpret the next day, if the vocabulary used to encode it has changed. Common clinical vocabularies must be more than a list of terms. They need to have a synonymy, multiple classifications, domain completeness, and provide consistent views of the definitions, while being unambiguous and avoiding redundancy.

In this article we examine different aspects of common clinical vocabularies. This article will:

List a set of requirements every organization should take into consideration when creating a common clinical vocabulary.

Discuss what common clinical vocabularies are for as well as why is it so hard to create one.

Provide a list of vocabularies that are currently used.

Discuss how change is handled in common clinical vocabularies.

Definition

Common Clinical Vocabularies are the natural prerequisite for disease and health outcome studies. Also, they are standardized terms and their synonyms, which record patient findings, circumstances, events and interventions with sufficient detail to support clinical care, decision support, outcomes research and quality improvement [8].

What is it for?

Based on [6] research, despite a vast literature on common clinical vocabularies, there is little information on what tasks they need to perform. However, [6] identifies some tasks vocabularies need to facilitate such as:

Collect information: on individual patients, population of patients, and institutions.

Present information: querying and retrieving information about patients.

Navigating and browsing through the information: either using the web or on a local repository.

Indexing knowledge: either medical knowledge or information about patients.

Analyzing and generating a natural language: which can be used internationally, based on local usage and preferences.

Why Common Clinical Vocabularies?
There are multiple reasons why common clinical vocabularies are needed. First of all, it is a challenge. But not any challenge, since it has been considered as one of the grand challenges for medical informatics. But most importantly, it is necessary to establish a common terminology that can be used to share data universally.

Computers play a big role, since they have changed the direction of medicine. Nonetheless, they complicate matters since patients can educate themselves using the internet. This can result in patients reading inadequate information about the proper medical action their disease requires. On top of that, there’s the English language; a complex language that looks even more complicated when we add the ambiguity and redundancy of the clinical terms used by doctors.

According to [8], The Institute of Medicine (IOM), conducted a study which points that 44,000 up to 180,000 Americans die each year as a result of medical errors. In another survey conducted at the 2000 Healthcare Information Management and Systems Society, 98 % of respondents believed that common clinical vocabularies would be important in reducing medical errors.

Common clinical vocabularies attempt to eliminate these problems by eliminating any semantics issues between doctors, nurses, researches, patients or the general public.

Why is it so hard?
The difficulties start by the fact that humans and computers understand information in different ways. On this regard, Donald Norman said “We are analog beings trapped in a digital world… We are compliant, flexible, and tolerant. Yet we have constructed a world of machines that requires us to be rigid, fixed, and intolerant” [6]. Nowadays, some of the existing technology for information exchange is designed for computers. As a result, users have to adapt to computers. The desired result is that vocabularies are understandable by health care professionals, but at the same time it is understandable to software engineers that work on health care systems.

In order to create a common clinical vocabulary, achieving consensus is required. Consensus is not always possible, since doctors, nurses and health care professionals disagree. One way to minimize the difficulties of achieving a consensus is to establish areas in which a level of consensus is appropriate, and what areas can be left for the local choice of the health care professional.

Clinical Vocabularies introduce a dilemma between the interpretation of the terms by the patients and doctors. If the ultimate aim is diagnosis by computer, it is mandatory to have a totally unambiguous clinical vocabulary. How hard is this to achieve? Charles Murray [4] conducted a survey to evaluate the interpretation of doctors and patients on clinical vocabularies. In his study, multiple-choice questionnaires were completed by 234 patients and compared with those completed by 35 doctors. On the study, the doctors reached a level of agreement of over 90%. However, the patients did not reach complete agreement of definition for any term. For example, on the word Diarrhea, 54% patients thought the term means “passing a lot of bowel motions in a short time”, while 68% of the doctors answered that the word means “Passing loose bowel motions” [4]. This example, shows how hard is to come up with unambiguous semantics, even for the most common terms.

Requirements
There are several requirements every common clinical vocabulary needs to have. Based on [8] research, we identified four basic requirements:

Evolving: the vocabulary needs to be expandable. The vocabulary needs to be capable to grow as new terms are created, existing concepts are refined, or some concepts are retired. Also, the vocabulary must carefully track changes and notify any violations concurred if the term is modified.

Unique: each term should have a single conceptual meaning. Terms cannot be vague or redundant. If a term in a common clinical vocabulary is discovered to have two or more meanings an appropriate response is to disambiguate these meanings by creating a separate term for each [2].

Unchangeable: once a term is defined, it should be permanent and immutable. If the concept is made inactive, the term still needs to retain its uniqueness and remain in the structure. Usually, terms that are deleted create a problem for systems that are using them [2]. For example, if a patient receives a diagnosis on a specific date, it is not acceptable to delete the diagnosis, only because the term was removed.

Hierarchical: a concept and its terms should be related to each other in the form of a hierarchy, based on the concept’s essential meaning. Although, individual terms can be represented in multiple hierarchies as long as they remain unique.

Impact on Health Care Organizations
Common clinical vocabularies can impact organizations in many ways. For example, common clinical vocabularies can create a link between the industry and organization-specific vocabularies [1]. Furthermore, it facilitates interoperability because organizations could exchange comparable data between them.

Integrating global and specific vocabularies allows Electronic Health Records (EHR) to be cross-referenced to standards that everyone can understand. Consequently, a health care organization could save time, money, and resources [1]. Finally, it reduces the opportunities for misinterpreted, inaccurate, imprecise data or human errors in a patient’s record. As a result, quality in the organization increases.

Vocabularies in Use
The following section discusses different vocabularies and classification standards.

International Classification of Disease (ICD)
The ICD is a set of classifications were one code typically represents a category in which several diseases may be mapped [9]. The classification has its origins in the 1850s [11]. Up until July 2010 the latest version is ICD-10 [11].

ICD has gained wide acceptance for coding clinical disorders, especially for hospital billing purposes [3]. The ICD is used internationally as a standard diagnosis classification for general epidemiological, health management purposes and clinical use [11]. Additionally, it includes terms for medical and surgical procedures, occupations, and other factors influencing a patient’s health status. The basic structure of ICD is a strict hierarchy.

Nonetheless, the ICD has several short comings such as: many categories are too broad to be clinically used; significant amount of details is lost when a paper-based record is coded, and it contains many ambiguous and redundant catch-all categories.

Systematized Nomenclature of Medicine (SNOMED)
In comparison with codings or classifications, SNOMED covers the breadth and depth of health care terminology. Several investigations confirmed SNOMED as a source with one of the best overall coverage of clinical content. It uses explicit hierarchies, description logic concept definitions, and relationships.

Unified Medical Language System (UMLS)
According to [10], UMLS facilitates the development of computer systems that operate as if the system knows the meaning of the language of health and biomedicine.

The UMLS Knowledge sources (databases) are distributed by The National Library of Medicine (NLM) in the United States. UMLS Knowledge sources are created for developers and not for end-users. Additionally, the NLM distributes software tools that can be used by software developers to create, process, retrieve, and integrate health data [10].

There are three UMLS Knowledge sources: The Metathesaurus, the Semantic Network, and the Specialist Lexicon. The Methathesaurus is a very large multi-lingual vocabulary database that contains information about biomedical and health-related concepts [10]. The Semantic Network is a set of broad categories that provide a categorization to all of the concepts represented in the UMLS Metathesaurus. The Specialist Lexicon is under development by the NLM to provide a general English lexicon that includes many biomedical terms.

Obtaining the Knowledge Sources or any software tool distributed by the NLM is free of charge and accessible over the Internet for any user. However, the use of the Metathesaurus requires a license agreement.

RadLex
Radlex is an initiative from the Radiological Society of North America (RSNA). It provides a uniform structure for capturing, indexing, and retrieving a variety of radiology information sources (e.g. radiology reports). Rather than “re-inventing the wheel”, Radlex unifies and supplements other lexicons and standards like: the SNOMED, UMLS, and others. Radlex is very beneficial for educators, clinical radiologists, and radiology researchers.

Radlex terms are organized into categories which provide an overall organization for the lexicon and are a guide for how imaging information can be used. Some examples include: treatment, uncertainty, image quality, and others.

In order to illustrate how RadLex can benefit radiology educators or researchers, a case study written by [5], will be presented in the next sub-section.

A RadLex Case Study: Clinical Decision Support.
A radiologist is interpreting a chest CT showing a tree-in-bud appearance. The radiologist is unsure whether the examination being interpreted truly exhibits this feature, and does not know the diagnostic possibilities that might explain the appearance.

Before Radlex the radiologist consults textbooks, journal articles, and online sources. However, he spends lots of time searching different databases and looking different results.

With Radlex the radiologist is able to search for a tree-in-bud on the RadLex site. He finds an image that matches the case at hand. A diagnosis of tree-in-bud is displayed, including links to relevant full-text articles from journal websites.

In this case study, Radlex is able to satisfy the needs of a radiologist. Nonetheless, if needed Radlex could also satisfy the needs of software developers, and systems vendors.

Digital Imaging and Communications in Medicine (DICOM)

DICOM defines a method of communication for medical image systems. It’s being developed by National Electrical Manufacturers Association (NEMA) and ACR (American College of Radiology). To facilitate interoperability it provides a protocol for communication, semantics of commands, but it does not provide any implementation details.

The goals of DICOM include: obtaining images and all of the information associated to a patient, achieve compatibility, and to improve the workflow efficiency between imaging systems and other information systems in health care environments worldwide.

Why DICOM?
The first reason is that is provides a single identification of images. A radiology department produces thousands of images per day. If images are classified in a JPEG or GIF format, they can lose the demographic data of the images. Consequently, DICOM associates information (such as name of the patient, type of examination, hospital, date of examination, type of acquisition, etc) to each image produced. Thus each image is autonomous. If an image is lost, it is always possible to identify formally its origin, the patient, the date, etc.

Each image has four unique identifiers: service-object pair class, study authority, series authority, and image UID. The service-object pair class identifies the type of service for which the image is intended. The study authority identifies a whole examination, in time and place. The Series Authority identifies a series of images within the examination. Finally, the Image UID identifies the image associated with the file.

The second reason is that it uses a common vocabulary. DICOM uses SNOMED to universally identify the data from machine to machine.

The third reason is that the format is used by different medical specialties. DICOM is used in radiology, cardiology, radiotherapy, and many others.

DICOM File Format
DICOM file format is composed of a header as well as the image data [7]. The header stores the information about the patient’s name, the type of scan, and the image dimensions. The image data can contain information in three dimensions. Also, it can be compressed to reduce the image size.

In a DICOM header, the first 794 bytes are used for a DICOM format header. These bytes describe the image dimensions and retain other text information about the scan. The image data follows the header. DICOM requires a 128-byte preamble followed by the letters ‘D’,'I’,'C’,'M’. This is followed by the header information that is organized in groups [7]. Some DICOM elements are required, but that is based on the image type. If this information is not available, the DICOM standards requirements are violated.

How to handle Change in Common Clinical Vocabularies
Clinical vocabularies and medical knowledge will grow. Evolution is necessary and inevitable. Changes in common clinical vocabularies have several advantages and disadvantages.

Some advantages include: addition, refinement, removing redundancies, and updating obsolete terms. Addition is required by the evolution of the discipline of medicine. Refinement is needed since one or more terms are added to a vocabulary to specify a greater level of detail. Any code or term that is added which is identical in meaning to an existing term needs to be removed. Finally, it can be said that new knowledge often requires the addition of new terms to a vocabulary. As a result, some terms need to be rendered as obsolete. Even though a term has fallen out of favor, we cannot remove them from a vocabulary because a patient could have been diagnosed with that term. Instead, new terms can be added as refinements to the obsolete terms.

Some disadvantages include: major name changes, and changed codes. With major name changes, changing the name corresponds to a true change in its meaning. There are two scenarios when dealing with major name changes: deletion and addition.

In the deletion case, terms may be deleted if the creators no longer wish to include the concept in the domain of the terminology. For example, if a patient was diagnosed with a disease on a particular date, it would be unacceptable to simply delete the diagnosis because the disease term is no longer part of the vocabulary. However, in most cases, no changes are needed. For example, if the laboratory stops performing a particular test, the existence of the term in the clinical vocabulary is harmless. Any previous occurrences of the test remain coded in the patient databases and remain interpretable.

In the addition scenario, when the new term represents a truly new concept, the proper response is simply to accept it into the vocabulary and use it when appropriate.

There are different ways to deal with change in clinical vocabularies. One way is to apply automated vocabulary maintenance methods. However, the right method can only be applied when the type of change is well understood. At present, no method can automatically detect the type of change needed for specific scenarios. For example, no method can differentiate between a minor and a major name change. Vocabulary changes usually do not include information regarding the reason for the change. Such information in a structured, machine-readable format might help. Nonetheless, the most efficient way to deal with change is to have domain experts perform manual reviews of the required changes.

Conclusion

After reviewing the literature for common clinical vocabularies, we can point out many lessons learned such as:

Common clinical vocabularies are an essential piece in the process of moving health care into an automated computerized way.

Clinical vocabularies can improve quality, and reduce errors on IT systems.

The ideal characteristics of a common clinical vocabulary include: concepts with one meaning, structured and controlled, and a sense of evolvability.

Patients will try to educate themselves on clinical terms using the internet. Achieving a consensus on clinical terms in necessary to avoid confusions between patients and doctors.

Until new methods are discovered, manual reviews by domain experts are the best way to deal with change in common clinical vocabularies.

SNOMED is the closest to a well established common clinical vocabulary.

The potential of common clinical vocabularies will depend on its ability have an impact on medicine and technology. But that will only happen when common clinical vocabularies are used and re-used in software while independently developed medical records, and decision support systems share the same information using the same terminology. If common clinical vocabularies have their way, they will become of routine use for all the parties involved in health care.

References

[1] – 3M Health Information Systems. “Using a Medical Data Dictionary to Comply with Vocabulary Standards and Exchange Clinical Data”. Retrieved on June 2010.

[2]- Cimino, J; and Clayton, PD. “Coping with changing controlled vocabularies. in Eighteenth Annual Symposium on Computer Applications in Medical Care”. 1994. Washington, DC: Hanley & Belfus, Inc, Philadelphia PA: pp. 135-139.

[3]- Cimino, J; and Johnson, Stephen. “Designing an Introspective, Multipurpose, Controlled Medical Vocabulary” in Proc. 13th Annual Symposium on Computer A pphcatzons zn Medical Care. L. C. Kingsland (ed.), IEEE Computer Society Press, November 1989, 513-518.

[4] Murray, Charles. “Difference between Patient’s and Doctor’s interpretation of some common medical terms”. British Medical Journal. 1970.

[5] Radiological Society of North America. “RadLex: Overview and Examples”. Retrieved on June 2010.

[6] Rector, Alan. “Clinical Terminology: Why is it so hard?”. 1999 Methods of Information in Medicine 38(4):239-252

[7] Rorden, Christopher.”The DICOM Standard”. Georgia State University. Retrieved on June 2010.

[8] Rose, Jeffrey; Hogan, William; Marshal, Philip; and Kirkley, Debra .”Common Medical Terminology Comes of Age, Part One: Standard Language Improves Healthcare Quality” Journal of Healthcare Information Management. 2001.

[9] – Rose, Jeffrey; Hogan, William; Marshal, Philip; and Kirkley, Debra .”Common Medical Terminology Comes of Age, Part Two: Current Code and Terminology Sets Strengths and Weaknesses”. Journal of Healthcare Information Management. 2001.

[10] – United States National Library of Medicine. “Unified Medical Language System Fact Sheet”. Retrieved on July 2010.

[11] – World Health Organization. “International Classification of Diseases “. Retrieved on July 2010.

May 3

I wrote this survey for my course called: “Advanced Topics in Data Bases: Cloud Data Management”

ABSTRACT
In this survey, we examine a challenge for cloud service providers: designing privacy into their cloud computing applications. The survey describes different privacy risks and threats that must be taken into consideration when designing privacy-aware cloud applications, including: data loss, legal liabilities, unauthorized access to data, and more. Additionally, we identify privacy requirements for privacy-aware cloud applications, such as: data quality, accountability, openness and transparency, and many others. For software engineers, architects and designers, we provide different guidelines for designing privacy-aware cloud applications which include: recommended practices, tradeoffs of privacy-aware designs, and technologies that are useful for the design stage. Furthermore, we present six privacy designs that present different solutions to different privacy issues. To conclude, we discuss other issues related to designing privacy-aware applications and present our conclusions and opportunities for future research.

1. INTRODUCTION
For users or organizations that do not have the resources or purchasing power to store and manage large amounts of data by themselves, cloud computing is a tempting solution. However, as users store their information in data centers that they do not operate, privacy becomes an issue. Privacy is the right to protect sensitive data and personal information from unintentional and intentional attacks and disclosure [7].
According to [13], in a survey made to European citizens regarding perceptions on privacy, two-thirds of the participants expressed concerns that organizations holding their personal information would not handle them properly. Furthermore, the survey also showed that eighty percent of the citizens interviewed feared data leakage. These results demonstrate the need for privacy-aware applications that can combat current and future threats.
In most cases, privacy is not a primary design goal in software development [8]. Consequently, many companies insert privacy as an add-on, which fails to provide enough privacy guarantees.
In this survey we examine guidelines and requirements for designing privacy-aware cloud applications, as well as looking at the latest design mechanisms used to solve different privacy issues on cloud applications. This survey also discusses the different privacy threats and risks that privacy-aware cloud applications are exposed to. This survey will:

  • provide an evaluation of privacy risks and threats in cloud computing;
  • list a set of privacy requirements software engineers should take into consideration when designing privacy-aware cloud applications;
  • provide a detailed review of privacy designs targeted to privacy-aware cloud applications;
  • provide an enumeration of design guidelines for software engineers who design privacy into their cloud applications;
  • and promote a discussion of future research directions.

The rest of this article is organized as follows: the background and the literature selection process are introduced in section 2; privacy risks and threats are reviewed in section 3; privacy requirements and guidelines for designing privacy-aware cloud applications are discussed in sections 4 and 5, respectively; privacy designs for different privacy issues are explained in section 6; some interesting privacy related issues are discussed in section 7, and this survey presents its conclusions in section 8.

2. BACKGROUND AND LITERATURE SELECTION
Prior to reviewing different designs, requirements and guidelines concerning privacy-aware cloud applications, we must define the problem and note some of the literature that has already been written on the topic.

2.1 Problem Definition
Privacy is one of the main concerns users or companies have about cloud computing. Protecting the privacy of a user is extremely important. Users are frustrated with data systems that do not define the behaviors that impact their privacy. Ignoring this frustration has led to an erosion of trust, negative press and even lawsuits [11].
Privacy aware data systems need to protect different types of information such as [14]:

  • Personal Identifiable Information (PII): information that could be used to identify an individual. PII could be a: name, address, phone number, fax number, email address, and others.
  • Sensitive information: information that must be specially protected because it could cause serious harm to a user. For example, information that could be used to discriminate against an individual based on religion,race, ethnic, background or political opinions. Furthermore, information that could facilitate identity theft or permit access to a users account (passwords or pins) is considered sensitive information [11].
  • Usage data: refers to data collected from devices such as printers, visited websites or the usage history of a product.
  • Unique device entities: information that might help to trace a user device such as: IP addresses or unique hardware identities.

Nowadays, cloud computing service providers have a very specific problem: design for privacy in order to decrease privacy risks and threats.
2.2 Literature Selection
The research was conducted in two electronic databases: the ACM Portal, and Google Scholar. These databases were chosen because of the broad range of computing disciplines they cover.
In this survey, the goal is to review research findings, and present a survey for designing privacy-aware cloud applications. The databases returned thousands of articles related to privacy, which were narrowed by reading the abstracts and looking for articles with keywords such as: privacy design, privacy guidelines, and others. Additionally, the list of references was evaluated to identify titles that met our search criteria. The result of this selection process allowed us to find different authors that investigated different aspects of designing privacy-aware cloud applications.
Pearson (2009) focused on the privacy challenges that software engineers face in a cloud computing environment. Additionally, Pearson suggested design principles that should be taken into consideration when software engineers work on cloud applications.
In 2007, Microsoft recognized that there were no industry-wide practices to protect a customer’s privacy. Consequently, they proposed a set of guidelines for respecting customer privacy, data integrity, and improving the level of trust between the industry and customers.
Gu and Cheung (2009) researched the development and testing of privacy-aware systems in a cloud environment. They believed that methodologies to design and test a system are a must and must be established on every cloud application.
Mowbray and Pearson (2009) developed a client based-privacy manager which helps to reduce the risk of storing sensitive information in a cloud.
Nyre and Jaatun (2009) proposed a way of analyzing policy enforcements made by cloud service providers by calculating the probability they will follow privacy policies.
Wang et al (2009) asserted a privacy-preserving public auditing system that is able to audit data without requiring a copy of the data.
Casassa et al (2003) worked on a privacy model to address two privacy issues: letting users control their personal information and make cloud service providers accountable of their behavior while they deal with user’s personal information.
Creese et al (2009) addressed whether there are opportunities to design data protection in the early stages of software development.

3. PRIVACY RISKS AND THREATS
In this section we discuss privacy threats and risks that software engineers must be aware of when designing cloud computing applications.

3.1 Privacy risks on cloud computing
In cloud computing applications, data is stored in a platform that is shared by multiple users and organizations. As a result, many risks arise from the fact that confidential information is stored outside of the boundaries of a user or an organization.
Pearson [14] correctly identifies the risks for parties involved in a cloud computing application such as: users, companies and cloud computing service providers.
Users using cloud applications face risks such as: being obligated or convinced to give personal information against their wishes. Furthermore, if gathered, their financial details and health data can be exploited against a user.
Companies using cloud applications face risks such as: data loss or leakage. As a result, companies using the services of a cloud provider are exposed to a loss of their reputation and credibility.
Nonetheless, cloud computing service providers are the ones that face greater risks such as: legal liability, loss of reputation and credibility, and lack of users trust.
Charlesworth and Pearson identified two privacy risks users are exposed to when using cloud computing applications: outsourcing, and offshoring.
Outsourcing of data processing raises governance and accountability questions [3]. For example, which party is responsible for ensuring legal requirements are observed, or that data is handled properly? To what degree can data processing be outsourced? How can users verify the identities of subcontractors?
It is very likely that cloud service providers that outsource their data processing to third parties will have weak trust relationships with their users. Furthermore, mechanisms such as data deletion will be hard to detect.
Offshoring data processing increases risks factors and legal complexity [3]. Questions of jurisdiction become relevant, including: in which country can a trial be conducted? Whose law applies?
A cloud service provider that combines outsourcing and offshoring may raise very complex issues [3].

3.2 Privacy threats on cloud computing
Privacy threats vary according to their scenario. Cloud applications can face low threats if the information, at some point, will become available to the public. On the other hand, services that are customized (based on location, user preferences, and others) face higher threats.
According to [14] there are several main threats software engineers should be aware of:

  • Personal information about a user could be used, stored or propagated in a way that is not acceptable according to a user’s agreement.
  • People outside of the cloud could get inappropriate or unauthorized access to personal data. This could happen by taking advantage of security holes or data being exposed. For businesses that stores sales data on cloud applications, they face threats that its data could be sold to business competitors, exposing confidential information about their business model.
  • Legal non-compliance. For example, restrictions on transborder data flow may apply, and some data may be subject to additional regulations.


4. PRIVACY REQUIREMENTS

The Fair Information Practice Principles (FIPPs) is an effort from the United States Federal Trade Commission to establish privacy policies for online entities that collect personal information. The FIPPs are widely accepted by many foreign nations and international organizations. These principles can be applied to cloud computing, and provide a good foundation to establish minimal privacy requirements that every cloud computing application should provide. The FIPPs are:

  • Accountability: organizations managing personal information should be accountable for taking steps to ensure that privacy practices and policies are followed. In addition, organizations need to audit their adherence to privacy principles, and monitor the controls used to manage privacy [5]. More details can be found in section 7.2.
  • Security safeguards: cloud computing service providers should be responsible for protecting personal information from being lost, destructed, used, and modified through all phases of the software life cycle.
  • Purpose specification: personal information should be limited to the purpose for which it was collected. The purpose of the collection should be specified, and if any changes occur, those changes need to be publicized as well.
  • Use Limitation: data should not be disclosed or used for anything other than a specified purpose without the consent of the user. Data should only be kept as long as is needed. To accomplish use limitation our design needs to create roles to define who can access the information, audit access and use, manage access based on authorization, and log access requests [5].
  • Openness and transparency: cloud computing service providers need to inform users what information they want to gather, how the information is going to be used, to whom it will be shared, and any other inquiries. Users should have the means to learn about how their personal information will be used.
  • Individual participation: users should be able to access the information, request modifications, and challenge the cloud provider’s privacy policies.
  • Data quality: users should be able to check the accuracy and completeness of their current personal information. Cloud providers have to guarantee the accuracy of the information held.
  • Choice and consent: users must be given a choice whether they want to share certain information or not. Cloud service providers need to create methods to obtain consent from users and document the history of given and denied consents [5].

According to [12], privacy legislation varies according to the country. Furthermore, privacy laws can have different views. For example, in the European Union, privacy is a basic right, but in the Asia Pacific region, privacy legislation is focused on avoiding harm. As a result, depending on the country, legislation could impose requirements such as: agreeing to rules regarding data retention and disposal, data access, and more.

5. GUIDELINES FOR DESIGN
This section provides guidelines for software engineers who design and develop privacy-aware cloud applications. It is unrealistic to expect that developers will be trained on privacy standards, but they do have a responsibility to follow a minimum set of development practices to reduce privacy flaws.

5.1 Recommended practices
According to [14] there are six recommendations software engineers, system designers, developers, and architects should take into consideration when designing cloud computing applications.

5.1.1 Minimize personal information sent to and stored in the cloud
The best way to protect a customer’s privacy is to not store his data [11]. However, data needs to be stored. As a result, cloud designers can benefit from analyzing the minimal amount of information required from a customer in order for a cloud to operate. Cloud applications need to store only data which is planned to be used immediately, and is absolutely necessary to achieve a determined business purpose. When data is no longer needed it needs to be deleted [11].
Storage data mechanisms can be lessened if there is less information to store in a cloud. Nonetheless, when personal information is sent to the cloud it can be protected in the dataset by using encryption or data mining techniques.

5.1.2 Protect personal information in the cloud
Personal information has to be protected from any loss or theft. Employees or independent companies that access a user’s personal information need to have a business purpose for accessing the data. Additionally, employees or third parties should only be given access to information they need to fulfill their business purpose. To ensure this, security safeguards can be used in order to prevent unauthorized access, copying, or modification of personal information.

5.1.3 Maximize user control
Users or companies must be given access to control the data that is being stored about them. Lack of control generates distrust. Giving control to users about their information generates trust. There are several ways to give users control of their information. For example, users should be able to access a user interface to modify their personal information on the cloud at anytime. Also, users could choose a third party company to audit the way their information is being managed on a cloud. In order to respond to these requests, it is important to design a system that is able to show how data for a specific user is being stored and disclosed.

5.1.4 Allow user choice
Users must be presented with a choice whether they want to share their information or not. A user’s consent must be obtained. To accomplish this, designers can create opt in and opt out mechanisms, to allow users to decide if they want to share their information or not. However, legal requirements for opt in and opt out mechanisms can vary among the different places a design may be used. It is preferable to use rigid requirements,which can satisfy most of the places a design might be exposed.

5.1.5 Specify and limit the purpose of data usage
When the information is loaded into the cloud, it must be limited to the preferences and conditions set by a user or organization. Data usage has to be restricted only to the user’s specified purpose. A cloud application design should always validate the data usage against the allowed usage intentions.

5.1.6 Provide feedback
Cloud applications should be user friendly and clearly indicate privacy functionality by using icons, providing tutorials, help documents, and visual cues. Applications need to be designed in a way that provides users with feedback, allowing them to make knowledgeable decisions about their privacy.

5.2 Tradeoffs of privacy-aware design
Designers of cloud computing applications need to provide protected and efficient interaction between users and providers.
Nonetheless, some traditional solutions that aid software engineers, architects and others to build privacy-aware cloud applications, introduce some tradeoffs to the design. According to [8], solutions such as encryption, deprive cloud service providers the opportunity of merging identical data, which would reduce storage space. Additionally, encryption hinders the capability to index and process the data.

5.3 Privacy Impact Assessment
According to [14], in the early stages of the design phase, it is recommended that cloud service providers conduct a Privacy Impact Assessment (PIA). The PIA is one tool used to aid an organization in making sure that the choices made in the design stage meet the privacy requirements of a system [2].
There are five reasons why [9] believes organizations should do a PIA:

  • Identifying and managing risks: PIA provides means of addressing project risk as part of the overall project management. Organizations may find it useful to plan a PIA within the context of risk management.
  • Avoiding unnecessary costs: conducting a PIA helps to identify problems in the early stages of a project. As a result, the cost of the making changes will decrease, since it is only at later stages where the cost of making changes is higher.
  • Inadequate solutions: when solutions for privacy risks are implemented at later stages, they are not as effective as those that are incorporated at the start of the project. Incorporating privacy solutions in the early stages can make the project more resistant and in a better position to recover from any possible failure.
  • Avoiding loss of trust and reputation: PIA provides the means that ensure that systems are not deployed with privacy risks or flaws which could surface into the media. As a result, PIA could help an organization to maintain and increase their reputation.
  • Informing the organization’s communications strategy: conducting a PIA should help the organization to understand the project, and evaluate the perspective of stakeholders. By understanding the concerns of the stakeholders, an organization can understand if further information is needed regarding a project, and can handle any misinformation campaigns created by an opponent.

Similar methodologies exist with a legal status in countries such as Australia, Canada, and the United States of America [14].

5.4 Adopt and Integrate Privacy-Enhancing Technologies (PETs)
Privacy Enhancing Technologies (PETs) is a set of tools or mechanisms that, when integrated or used alongside an application, reduces the risk of breaking privacy principles or legislations [10]. Additionally, PETs diminish the data a cloud service provider needs to store about a user and allows individuals to control their information [10].
According to [10], when it comes to handling personal information, PETs provide good design goals, offer demonstrable business benefits and a competitive advantage for cloud service providers that adopt them. PETS can be classified in two categories: privacy management and privacy protection tools.

5.4.1 Privacy Management Tools
Privacy management tools allow users to look at the procedures and practices used by cloud service providers that handle their information. Additionally, they tell the users the consequences of sharing their information which improves the user’s understanding of privacy-related issues.

5.4.2 Privacy Protection Tools
Privacy protection tools hide a user’s identity, reduces the information revealed to a cloud service provider, and covers-up network connections details. Privacy protection tools are able to authenticate online payments while making it impossible to find a
connection to the user originating the transaction [10]. Several software tools fall into this category such as:

  • Anonymising tools: these minimize the information exposed to a cloud service provider. For example, they can hide the IP address of a user.
  • Information security tools: these prevent unauthorized access to systems, files or communications in a network.

5.4.3 Drawbacks
For cloud service providers that use agile development for their privacy-aware systems, it is very difficult to agree and develop standards for PETs [10]. Some providers feel PETs introduce unnecessary complexity or that the technology itself could become obsolete in the near future. Also, legacy systems have a hard time integrating PETs since they are incompatible.

5.4.4 Future of PETs
According to [10], PETs researchers concur that there is a need to design systems in a privacy-friendly way, and for cloud service providers to incorporate PETs into their systems design.
In the future, [10] believes that research into user-centric identity management (U-Idm) in conjunction with PETs may represent a solution to manage and control personal information in a secure way. A U-Idm framework, in the most part, allows users to control their own data on a personal device they fully control. U-Idm frameworks can update information without revealing unnecessary identifying details. An important milestone for U-Idm frameworks is Microsoft Windows CardSpace, which is an identity platform which integrates U-Idm frameworks technologies.

6. PRIVACY DESIGNS
In this section, we review different designs for cloud computing applications. This section demonstrates different design models that can aid cloud application designers in dealing with different privacy scenarios such as data leakage, and data access.
Since users will use mechanisms to protect themselves against cloud applications, it is important to know the design of those systems a user may use. Considering that cloud applications will interact with any application users may use to communicate, it is worthwhile for designers to understand how these applications are designed.
As a result, this section discusses some privacy design targeted for users such as: determining the probability a cloud service provider will enforce its privacy policies, and a privacy model which makes cloud applications accountable for the way a user’s data is handled.
In addition, this section explains a design to protect cloud applications against third party auditing, and introduces the concept of sticky privacy policies.

6.1 A Client-Based Privacy Manager
Mowbay and Pearson, worked on a client-based privacy manager, whose goal was to reduce the risk of data leakage and the loss of privacy on sensitive data processed in a cloud. The privacy manager is on the client side to help the user protect his privacy when accessing cloud services [12]. Nonetheless, the privacy manager requires the help from a server-side component for effective operation.
6.1.1 Features
According to [12], the privacy manager provides five important features:

  • Obfuscation: the privacy manager provides obfuscation and de-obfuscation of data. Using a key which is chosen by the user (and not revealed to cloud service providers), data can be obfuscated when it’s sent to the cloud. As a result, applications in the cloud or attackers will not be able to de-obfuscate the data. Obfuscation techniques are more attractive to users since they have full control over the data, and it hinders the cloud provider’s capability of using the user’s content for advertising purposes.
  • Preference setting: this sets the user preferences regarding the handling of its personal data stored within the cloud. Nonetheless, for this feature to be useful, it needs policy enforcing mechanisms within the cloud.
  • Data Access: a module designed to allow users to access personal information and see what is stored about them and its accuracy. It serves as an auditing mechanism to detect privacy violations. The module store logs on the client machine, when the personal information is accessed.
  • Feedback: provides feedback to a user about the usage of its personal information in the cloud. This module monitors if the data is transferred outside of the cloud.
  • Personae: allows users to choose among multiple personas when interacting with a cloud. In some contexts, a user might want to act in an anonymous manner, whereas in other situations he may want to reveal all or part of his identity.


6.1.2 Evaluation of the Client-Based Privacy Manager

Mowbay and Pearson’s client-based privacy manager solution meets some of the minimal privacy requirements a cloud application requires such as [12]:

  • Limits the use of the data with the obfuscation module.
  • Purpose specification is specified using the preference settings module.
  • Openness and transparency is provided via the feedback and data access features
  • Choice and consent is provided with a user-centric design [12]. The preference setting feature gives users control over their data and the personae feature makes it simpler.
  • Security safeguards can be specified with the assumption that the data access module will be deployed on the service-side.


6.1.3 Drawbacks

The solution proposed by [12] is not appropriate for all cloud applications. The privacy manager needs the full cooperation of the cloud service provider. Cloud service providers that sell the user data to advertisers may not be willing to allow users to preserve their privacy. Furthermore, some service providers may be willing to respect a user’s privacy wishes, but may not agree to implement the service-side code necessary for the privacy manager’s feature to work.

6.2 A Virtual Private Data Repository
Nowadays, it is still a challenge to make a general mechanism to assure data privacy in clouds. In cloud applications both users and developers access data. Usually, users and developers organize data in different ways. Users manage data using file systems while developers use relational databases [8]. Solutions such as Amazon Simple Storage Service or Bigtable take a similar approach, while providing scalability. Nevertheless, these scalable solutions do not provide strong privacy guarantees or a friendly user interface.
Gu and Cheung recognized an opportunity to create an efficient, easy-to-use interface to access privacy-aware applications. They researched the architecture for a privacy-aware data service. Their goal was to design a privacy-aware general mechanism to access data in cloud environment applications.
To achieve a general privacy-aware data access mechanism, Gu and Cheung designed a “virtual private data repository” (VPDR) [8]. The VPDR provides a file system interface which is familiar to both users and developers. The data written into the VPDR is obfuscated and de-obfuscated with the aid of an access token. The VPDR architecture is based on three components: the virtual private disk (VPD), the virtual network buffer (VNB), and a virtual cloud storage (VCS).
The VPD is a privacy component that can reside in the cloud application or in the user’s computer. The VPD serves as an input/output device where the VPDR is constructed [Gu and Cheung].
To obfuscate the data the VPD slices the data at a bit level based on the access token. According to [8], there are several benefits to this approach. First, without the access token, an intruder would have to collect a large number of slices and perform many matching tests to get access to the data. Second, if multiple providers are used, the complexity increases for intruders. Third, a bit level slice mechanism creates an illegible but structural sequence. As a result, cloud applications could still perform certain operations including: compression, merging, and removing duplicates.
Gu and Cheung also argued that their solution should consider ways of preventing cloud service providers from accumulating data from users. If users are able to store their data on multiple cloud service providers they could avoid the fact that providers could accumulate their data. The VNB component addresses this problem by preventing cloud service providers from collecting data from users. To accomplish this, it communicates with the providers with a control uncertainty. Furthermore, its main function is to separate the link between the bit slices and its users.
The VCS component resides in the cloud, and it makes sure the user data is sliced and stored in different partitions, so an operator cannot easily combine slices to retrieve the original data.
One drawback to this design is that the data could be deciphered with vast computing resources. Additionally, the VCS component complicates the process of deleting and migrating user data. This is a consequence of the uncertainty between the data and its owner.

6.3 Probabilistic Privacy Manager

Nyre and Jaatun designed a system architecture that will give the users the probability that a specific cloud service provider will respect their requirements and enforce privacy policies. This model could be used to handle uncertainty in privacy enforcement and as a tool to interact with unreliable entities. The architecture is composed of five components: Personal Data Recorder (PDR), Personal Data Monitor (PDM), Trust Assessment Engine (TAE), Trust Monitor (TM), and a Policiy Decision Point (PDP).
The web provides many opportunities for information aggregation. An example would be where a user wants to stay unidentified but needs to provide his postal code and an anonymous e-mail address; later a user uses the same anonymous e-mail and additionally provides his age and given name. At this point, a given provider can combine the data and identify an anonymous user [13]. The PDR component solves this problem [13]. The PDR records what information is sent to one or more providers. Also, it gives the user an idea of how a cloud service provider sees this information, which allows them to judge if they are sending too much information or not.
The PDM calculates the probability that an entity will forward the information to another entity. Also, it updates the PDR with collected knowledge.
The TAE module assesses communicating parties by calculating a trust value for determining their trustworthiness.
The TM module detects events that could affect a perceived trustworthiness. This module decides, based on any given circumstance, if the entity has an acceptable level of trust. Additionally, it contains a repository in which it stores feedback from other entities regarding a provider.
The PDP decides if the information should be shared with an entity and under what conditions.
6.3.1 Benefits
According to [13], the probabilistic privacy manager design provides four key benefits to a privacy application: It informs the user about the trustworthiness of an entity. It provides anonymity when it’s necessary. It saves the users willingness to interact with an entity. It calculates the consequences of interacting with an entity.
6.3.2 Drawbacks
The solution designed by [13] has some drawbacks, including:

  • Their TAE does not take into consideration risk willingness and vulnerabilities an entity can present when calculating their trust value score.
  • The PDR is not able of handling redistribution of data (receiver forwards the data to other receivers).
  • The solution does not include a privacy or trust model.

6.4 A Privacy-Preserving Public Auditing Scheme
In case a cloud service provider enables public auditability, users can hire a third party auditor (TPA) to audit their data on their behalf. Public auditability can be referred to as ensuring an external party other than the service provider ensures that the remotely stored data is correct and has not being modified.
Wang et al investigated the fact that cloud service providers do not have schemes that support privacy protection against external auditors. As a result, TPA could introduce new vulnerabilities for users, such as leakage of unauthorized information from their data. Therefore, [15] design goal is to allow TPAs to verify the correctness of the data inside a cloud without demanding the copy of the whole data.
Wang et al propose a privacy-preserving public auditing scheme that protects a user’s privacy against TPAs. According to [15], data encryption before storing data into the cloud is used as a complement to the proposed scheme. The reasoning is that encryption does not solve the problem; it only reduces it to managing encryption keys that can still be exposed.
The public auditing scheme consists of four algorithms: KeyGen, SigGen, GenProof, and VerifyProof.
The KeyGen is a generation algorithm that is run by the user to setup a scheme. The SigGen algorithm is used by the user to generate verification metadata that will be used for auditing [15]. The GenProof is run by the cloud service provider to generate a proof of data storage correctness. The VerifyProof algorithm is run by the TPA to audit the proof from the cloud service provider [15].
Based on the algorithms, the public auditing scheme can be constructed in two phases: setup and audit.
In the setup phase, the user initiates public and secret parameters by executing Keygen. Then, the user pre-processes the data file by using SigGen to generate verification metadata. At this point, the user can store the data file in the cloud, and publish the verification metadata to the TPA for later audit.
In the audit phase, the TPA confronts the cloud service provider to verify that the data file has been preserved appropriately. In this scenario, the cloud application will send a response message by executing the GenProof. The TPA can verify the response using the verification metadata via VerifyProof.
To support public auditability without retrieving the data blocks, the privacy-preserving public auditing scheme uses the homomorphic authenticator technique. The homomorphic authenticator generates an unforgeable verification metadata from individual data blocks, which assures an auditor that a linear combination of data blocks is correct by verifying the aggregated authenticator [15].
However, if sufficient linear combinations of a data block are collected, the TPA could decipher the user’s data by solving a system of linear equations. Consequently, the linear combination is masked with randomness generated by a pseudo random function (PRF). As a result, the TPA would not have all the necessary information to build up a correct group of linear equations to learn any knowledge about the data stored in the cloud [15].
According to [15], different performance benchmarks proved their solution is very efficient and secure.

6.5 Sticky Privacy Policies

Creese et al consider that every piece of data residing on equipment that is not managed by users needs to have its privacy addressed. As a result, Creese et al explored methods to design data protection into a cloud application in the early stages of development, avoiding costly future issues and poor protection from design decisions that disagree with data protection needs [4].
To build a data protection mechanism in clouds, Creese et al designed a pattern called Sticky Privacy Policies. The intent of the mechanism is to bind a specific privacy policy to data when it’s stored, processed, and shared.
The sticky policies make sure that multiple parties are aware of the data’s policies and act in accordance to them. For example, a sticky policy can specify that the data can only be used for a particular purpose, by certain people or that the user must be contacted before the data is used.
In sticky policies, the personal information is associated with machine-readable policies that can be composed and extended in flexible ways. For example, Cassasa et al showed how sticky policies can be represented in an XML-based format. In an XML format a sticky policy can contain:

  • An owner tag: expressing information about the owner of the data, including an email address. This information can be encrypted using techniques such as Identifier-based Encryption, which is discussed in section 6.6.
  • A validity tag: that contains the expiration date of the policy.
  • Constraint and actions tags: the constraint tag can require a requestor or third party to authenticate before accessing the data. The action tag can notify the owner of the information if any usage of its data seems suspicious.

Creese et al identified some design issues their approach needs to address such as:

  • To what level of granularity of data a policy should be attached? For example, a personal data element such as names or addresses can have a defined policy, but also a database could have a policy attached.
  • The mechanism needs to be compatible with legacy systems.
  • For practicality reasons, it might be better to have a reference to a policy, instead of an actual policy bound to the data.

The solution designed by [4] has several drawbacks such as: the data could be used by receivers in a way that the data owner or user would not like. Also, the policy could be ignored or detached from the data. Additionally, if the data is bonded with the policy, the data can be heavier and not compatible with some applications.
Despite these drawbacks, policy specification and verification tools such as The Enterprise Privacy Authorization Language (EPAL), W3C P3P, and others have already adopted the idea of sticky policies.

6.6 An Accountable Management of Identity and Privacy Model
Casassa et al researched the fact that users have little control over the destination of their data once it is released to a third party. Furthermore, organizations are not accountable for the information they share with other organizations.
For example, in an e-commerce scenario, users deal with transactions that span across multiple e-commerce websites. It starts when a user provides their identity to an e-commerce website to access their services. When the user interacts with a website, it may be that they are also interacting with other organizations. There is a chance that the website discloses personal data to other organizations in order to fulfill a transaction. To solve this problem, Casassa et al suggest a mechanism to associate disclosure policies for personal data but most importantly, increase the accountability of the implicated organizations.
The model proposed by [1] has several key aspects. First, [1] adopts the sticky policy paradigm, to allow users to agree to an applicable privacy along with opt-in and opt-out mechanisms. Also, the proposed model uses a Tracing Authority component which tracks the disclosure of data by an organization. Other key aspects include: obfuscation of personal information, disclosure of personal information if sticky policies constraints are followed, and enforced tracing and auditing of disclosures of personal data, to increase the accountability of the organization receiving the data.
A high level model of [1] proposed privacy model to enforce accountability on organizations can be explained on seven steps based on the e-commerce scenario:

  • Users use graphical tools to define their sticky policies, obfuscate their data, and associate the obfuscated data to their customized policies.
  • The user can start interacting with an e-commerce website by providing digital packages with the obfuscated data along with their sticky policies.
  • The requestor (the e-commerce website) interacts with the Tracing Authority component to demonstrate that the involved terms and conditions are understood.
  • The Tracing Authority receives a request and checks the integrity and trustworthiness of the requestor’s credentials.
  • In the [1] model, nothing prevents the user from being involved in the disclosure process. As a result, the user can approve or disapprove the disclosure of their information.
  • The actual disclosure of obfuscated data to a requestor (the e-commerce website), only happens if they can demonstrate to the Tracing Authority that they can obey the sticky policies set by the user
  • Disclosure of personal information is logged and audited by the Tracing Authority. At this step, the accountability of the requestor is logged, and evidence about their knowledge of the users’ personal information is created. If the information is indiscriminately distributed to other organizations, the Tracing Authority has enough evidence for forensic analysis [1].

The accountable management of identity and privacy model proposed by [1], uses two technologies to accomplish the model explained above: Identifier-based Encryption (IBE) and Trusted Computing Platform Alliance (TCPA).
The IBE is an emerging cryptographic schema where any type of string (containing a name, role, terms and conditions, and many others) can be used as encryption keys. The TCPA is able to check that the receiver’s operating system is a trusted platform. Also, the TCPA can verify that the software installed on the computer is conformant with the disclosure policies and can implement the defined privacy management mechanisms.

7. OTHER ISSUES
In sections five and six, we discussed guidelines and privacy designs used for privacy-aware cloud applications. Other related issues that have an impact on privacy-aware cloud applications include: testing, accountability, terms of service and privacy policies. Those issues are discussed in this section.

7.1 Testing Privacy-Aware Clouds
In traditional software applications, users are only able to change configurations or options. In cloud applications, the user’s involvement is more significant, and more instantaneous. The software behavior changes continuously with the user behavior. As a result, users are closely involved with the design of cloud applications, by either changing the program state affecting the system behavior, or changing the logic of the application.
According to [8], in testing, user’s participation is even more active and direct. Users might not realize that they have become a powerful and indispensable part of the testing and quality assurance team. For privacy-aware applications in clouds, it is critical to provide good privacy mechanisms and avoid revealing internal information about the application to the parties involved in the testing.

7.1.1 A new testing paradigm
In traditional software companies, the developer-to-tester ratio is around one to one. However, leading internet application providers have more developers than testers [8]. The disparity shows a new paradigm in quality assurance methodologies. The new paradigm is a result of the capacity of both internet and cloud applications to rapidly release fixes for bugs, and new versions of software. As a result, it is a challenge to ensure that an application still holds high standards of software quality.
Gu and Cheung suggest that to solve this problems, cloud service providers have taken an incremental-release approach. Usually, cloud service providers will release a new set of changes to a small group of users. If the testing is successful, then the changes are usually applied to the production application. The incremental testing is unlikely to affect users if it is carefully managed. Furthermore, it enhances the communication among designers, developers, and software engineers while incorporating the users into the test environment.

7.2 Accountability
Cloud service providers should value accountability and include this principle in their design stages of their privacy-aware applications. Accountability can be defined as placing a legal responsibility to an organization that stores PII, and ensuring that if an organization supplies PII to a third party, it abides the previously agreed privacy policies.
According to [3], accountability may be a good principle to provide privacy towards cloud computing applications. Charlesworth and Pearson identify five elements to provide accountability in privacy-aware cloud applications:

  • Transparency: the level of openness about a cloud service provider’s handling of PII that allows meaningful accountability. Users should be informed about how their information is used within the cloud.
  • Assurance: through privacy policies, cloud service providers can provide assurances to respect contractual measures and audits.
  • User Trust: accountability promotes user trust. When users are not clear why their information is requested or how it will be processed, this lack of information leads to suspicion and distrust.
  • Responsibility: data protection requires a big share of responsibility from the cloud service provider. Establishing responsible and accountable privacy standards allows providers to assess risks in terms of financial losses and privacy breaches.
  • Policy compliance: accountability ensures that cloud service providers fulfill the laws.

Finally, it can be said that incorporating accountability into privacy-aware cloud application guarantees that laws that apply to cloud computing are followed.

7.3 Terms of Service and Privacy Policy
According to [6], terms of service, from a privacy perspective, may be the most important feature of cloud computing for a user who is not subject to a legal obligation.
The terms of service is a key document that attempts to define the relationship between a user or customer and the provider of a service, the service itself, and the parameters used to define the performance of the cloud service provider [4].
Cloud service providers offer its services to users without individual contracts, but subject to their terms of service. If the terms of service give providers control over a user’s personal information, the user must respect those terms. As a result, cloud service providers may be able to copy, use, change, publish, distribute, display and share this information with their affiliates [6].
In addition, cloud service providers reserve the right to change policies without any limits at any time. Consequently, if a user agreed to a cloud provider’s terms of service, and the terms changed without the user’s awareness, a change could create legal liabilities to users.
The terms of service allows a cloud service provider to terminate the user at any time. Therefore, if a user does not have a backup of their information, it can be lost. For organizations or government agencies this could be disastrous [6].
Creese et al believe that by understanding and analyzing the terms of service, software engineers can formulate engineering requirements which can help in the process of design privacy-aware solutions for clouds.

8. CONCLUSIONS
After reviewing design practices for cloud computing applications, the lessons learned, and future opportunities to design privacy-aware applications can be summarized based on the existing research.
Taking privacy into account when designing cloud computing applications is critical if personal information is going to be collected, processed or shared. Privacy should be a fundamental design goal, and it should cover both users and service providers.
Furthermore, privacy should be built into every phase of the development process; it cannot be added at a later stage. Privacy-aware clouds need to have privacy testing methodologies to ensure that internal activities in the cloud and product features are not leaked to third parties.
In the context of the conducted research work, future work would be well advised to:

  • Explore software development methodologies such as agile development. In cloud computing, requirements for an application change based on the user’s needs. Having a full design specification is not always possible. Applications need to be tested more frequently. As a result, designing privacy-aware cloud applications on agile environments will become relevant [14].
  • Incorporate privacy templates. It may be useful for developers and software engineers not only to have guidelines such as the ones described in section 5, but to have privacy templates for the different privacy scenarios that can take place [14].
  • Good privacy designs and accountability go together. It is a practical mechanism to reduce a user’s privacy risks and enhance a cloud service provider’s credibility.
  • Create a language for privacy design. Efforts in the future should focus on a privacy language whose goal is to equip executives and technology professionals with a shared vocabulary that allows them to create and discuss privacy requirements in an understandable manner for all parties [10].

Different privacy guidelines, requirements and design models were suggested that may be used by software engineers, architects or developers, in order to reduce the risks and threats on cloud applications. Nonetheless, there are still opportunities for improvement. Future work might address one or more of the following:

  • How consent and revocation of consent can be provided within privacy-aware cloud applications? A United Kingdom project called EnCoRe (Ensuring Consent and Revocation) is trying to answer this question by examining solutions in the areas of consent and revocation of personal information. [12]
  • Cloud applications must consider scalability, exploit parallelism, and at the same time protect a user’s privacy. To address this challenge, parallelization models such as MapReduce have become popular. Considering MapReduce does not provide a mechanism to protect a user’s privacy, how can we enhance existing parallelization frameworks to provide privacy protection?
  • Testing a software component to provide privacy guarantees involves designing a set of test scenarios [8]. Can we exhaustively test a piece of software against a set of privacy policies? How can we define exhaustive criterion? How can we evaluate the quality of the testing performed?
  • When software engineers incorporate terms of service into their applications, how can they determine design pattern properties and the details required by the terms of services to satisfy a user?
  • How can cloud service providers recruit individuals with accredited skills in privacy management and designing? According to [10], in the United Kingdom there is no body that is recognized as providing such accreditation. Therefore, there is a clear need to establish a professional body for privacy professionals not only in the United Kingdom but across the globe.

Designing privacy into cloud applications is a win-win situation, in which users and cloud service providers are the beneficiaries. Designing privacy-aware cloud applications provides a more efficient management of personal information which reduces processing costs, provides a more precise data, and creates a competitive advantage gained through trust and responsible management of personal information. When cloud service provides adopt privacy as their design priority, privacy risks and threats will be a thing of the past.

9. REFERENCES
[1] Casassa, Marco., Pearson, Siani., and Bramhall, Pete. “Towards Accountable Management of Identity and Privacy: Sticky Policies and Enforceable Tracing Services”. In: DEXA 2003, pp. 377-382. IEEE Computer Society. 2003
[2] Cauvoukian, Ann. “Privacy By Design”. January 2009. Available via http://www.privacybydesign.ca/pbdbook/PrivacybyDesignBook-ch17.pdf
[3] Charlesworth, Andrew .,and Pearson, Siani. “Accountability as a Way Forward for Privacy Protection in the Cloud”. HP Labs. December 2009.
[4] Creese, Sadie., Hopkins,Paul., Pearson, Siani, and Shen, Yun. “Data Protection-Aware Design for Cloud Computing”. HP Labs. August 2009.
[5] EXOCOM Group, Inc., “Privacy Technology Review”, August 2001. Available via http://www.hc-sc.gc.ca/hcs-sss/pubs/ehealth-esante/2001-priv-tech/index-eng.php
[6] Gellman, Robert. “Privacy in the Clouds: Risks to Privacy and Confidentiality from Cloud Computing”. World Privacy Forum. February 2009.
[7] Griffin, Lavonne. “Technology Definitions”. May 2009. Available via http://www.brownfield.org/auditor/index.cfm?a=114449&c=42101
[8] Gu, Lin.,and Cheung , Shing-Chi. “Constructing and Testing Privacy-Aware Services in a Cloud Computing Environment, Challenges and Opportunities”. October 2009.
[9] Information Commissioner’s Office. “Privacy Impact Assessment Handbook”. December 2007. Available via http://www.ico.gov.uk/upload/documents/pia_handbook_html_v2/html/1-Chap1-2.html
[10] Information Commissioners Office, “Privacy by Design: An overview of privacy enhancing technologies”, November 2008. Available via http://www.ico.gov.uk/upload/documents/pdb_report_html/privacy_by_design_report_v2.pdf
[11] Microsoft Corporation. “Privacy Guidelines for Developing Software Products and Services”. 26th April 2007. Available via http://www.microsoft.com/Downloads/details.aspx?FamilyID=c48cf80f-6e87-48f5-83ec-a18d1ad2fc1f&displaylang=en
[12] Mowbay, Miranda., and Pearson, Siani. “A Client-Based Privacy Manager for Cloud Computing” . HP Labs. June 2009.
[13] Nyre, Asmund., and Jaatun ,Martin. “Privacy in a Semantic Cloud: What’s Trust Got to Do with It”. Proceedings of the 1st International Conference on Cloud Computing. 2009.
[14] Pearson, Siani. “Taking Account of Privacy when Designing Cloud Computing Services”. HP Labs. ICSE’09 Workshop. May 2009.
[15] Wang, Cong., Wang, Quian., Ren, Kui., and Lou, Wenjing. “Privacy-Preserving Public Auditing for Data Storage Security in Cloud Computing. Illinois Institute of Technology”. IEEE Infocom. November 2009.

Jan 13

Hadoop Distributed File System contains 114 java files. After analyzing these files, I concluded the HDFS concrete architecture can be represented in three main subsystems: Server, Protocol and Tools. Additionally, it contains other components such as: HDFS Policy Provider, Checksum Distributed File System, Distributed File System Client, Distributed File System Util, HFTP File System, and HSFTP File System. Figure 1 illustrates the concrete architecture of the Hadoop Distributed File System.

HDFS
Figure 1: Hadoop Distributed File System Concrete Architecture

3.2.1 Server
Hadoop Core master-slave architecture is clearly represented in the server subsystem. It shows the interaction between the datanode and namenode components. Additionally, it shows the protocol both components use to communicate with one another. The server subsystem contains five components which are explained in sections 3.2.1.1 – 3.2.1.5.

3.2.1.1 Protocol
The protocol component contains the datanode protocol: a protocol that a DFS datanode uses to communicate with the namenode. The datanode protocol sends a heartbeat to tell the namenonde that a datanode is still alive, with some status information appended. It also determines actions such as: transfer blocks to another datanode, invalidate blocks, shutdown a node, request a block recovery, and others.
The protocol component also handles the instructions sent to a datanode regarding some blocks under its control. It tells the datanode either to invalidate a set of blocks, or to copy a specific set of blocks into another datanode.

3.2.1.2 Balancer
Distribution of blocks across datanodes can become unbalanced [6]. An unbalanced cluster relies heavily on highly utilized datanodes. The balancer tool, re-distributes blocks by balancing disk space usage on a HDFS cluster. It moves over-used datanodes to under-used datanodes, while placing block replicas on different racks. In addition, the balancer runs until the cluster is balanced, it cannot move any more blocks, or loses contact with the namenode [6].

3.2.1.3 Common
The Common component provides internal constants for HDFS. Also, it throws exceptions when a file system is inconsistent and is not recoverable. Additionally, it contains common classes for storage information. It stores the type of node (namenode or datanode), storage layout version, and the file system creation time. Storage can reside in multiple directories, yet each directory contains the same version. Furthermore, the common component provides a common interface to upgrade namenode or datanode objects.

3.2.1.4 Datanode
The Datanode component stores a set of blocks for a DFS deployment. Moreover, it maintains a map from a block to its metadata. A datanode communicates regularly with a single namenode. It can also communicate with other datanodes. One deployment can have one or many datanodes.
The datanode allows a client to read blocks, or write new block data. When instructions are received from the namenode it may delete or copy blocks from other datanodes. Blocks are stored on a local disk. When a server starts, the datanode reports the table of contents to the namenode. However, it is also capable of maintaining various statistics of the blocks. Additionally, the datanodes maintain an open server socket so that client code or other datanodes can read or write data.

3.2.1.5 Namenode
The namenode subsystem manages the file system namespace and controls access by external clients. Unless there is a second backup namenode, usually there is a single namenode running in any DFS deployment. It contains key components such as:
• INode Directory: keeps an in-memory representation of the block hierarchy.
• Log Manager: reads and writes log data from storage
• File System Directory: handles the writing and loading values to the disk, and logs the changes as it happens.
• Secondary Namenode: a helper to the primary namenode. It is responsible for supporting periodic checkpoints of the HDFS metadata. In a HDFS cluster, only one secondary namenode is allowed.
• File System Image: stores all information about the file system namespace.

3.2.2 Protocol
The protocol subsystem allows a user to communicate with a namenode. Also, it allows users to finds lost blocks, check for quota and file exceptions. The protocol subsystem contains the following components:
• Exceptions: manages exceptions such as: when a user wants to create a file that is being created but is not closed yet, disk space and namespace quota is exceeded, datanodes that are not previously registered try to access namenodes, and others.
• Client Protocol: provides a protocol for block recovery. Additionally, it allows users to manipulate the directory namespace, and open and close file streams.
• Blocker Reporter: it reports where to find a collection of blocks and its file length.
• Data Transfer Protocol: streaming protocol used by the client to transfer data to and from the datanode.

3.2.3 Tools
The tools subsystem provides administrative access to the HDFS, and provides a rudimentary tool for check DFS volumes for errors and sub-optimal conditions. It contains two components:
• DFS Admin: reports how the file system is doing. It allows administrators to put a cluster in safe mode, generate a list of datanodes, and decommission datanodes [1].
• DFS Volume Check: scans all files and directories starting from an indicated root path. It detects and handles abnormal conditions. Additionally, it is able to collect DFS statistics, and can print statistics on block locations and replication factors of each file.

3.2.4 Additional HDFS Components
Other components found in the HDFS concrete architecture include:
• HDFS Policy Provider: provides the HDFS definitions and protocols for the security in effect.
• Checksum Distributed File System: creates a checksum file for each raw file. It generates and verifies checksums at the client side.
• Distributed File System Client: component that connects to a Hadoop file system and performs basic file tasks such as: rename, delete, set permissions, set file or directory owner, set or reset quotas, and others. It uses client protocol to communicate with a Namenode daemon, and connects directly to datanodes to read and write block data.
• Distributed File System Util: a utility to verify whether a pathname is valid. It prohibits relative paths, or names that contain a “:” or “/”.
• HFTP File System: provides the implementation of a protocol for accessing file systems over Hyper Text Transfer Protocol (HTTP).
• HSFTP File System: provides the implementation of a protocol for accessing file systems over Hyper Text Transfer Protocol Secure (HTTPS).

Aug 22

I wrote this paper with Yongning Zhang for a course I am taking called: “Advanced Topics in Human-Computer Interaction: Experimental Methods in HCI”. I would like to thank Dr. Edward Lank for his help.

Do language-checking tools improve the document quality of non-native speakers?

ABSTRACT
Text editors help users to create and share documents. We choose specifically on one of their most popular feature, language-checking tools. We focus on non-native speakers of the English language and wonder if language-checking tools improve the quality of their documents. We present a quantitative study which finds the effects of language-checking tools on documents. We also conducted a qualitative study to find how non-native speakers use language-checking tools. Results show that language-checking tools do not significantly improve the quality of your documents. However, users still trust them.

INTRODUCTION
In the past decades, language-checking tools have been integrated into different text editors and have been widely adopted by computer users. Recently, with the help of Web 2.0 technology, several web-based editors are able to provide spelling and grammar checking in real time. Users do not need to install any language-checking tools on their computers.

Most of the language-checking tools are designed to scan over the text to find spelling and grammar mistakes, such as: fragments, run-on sentences, subject-verb disagreement, passive voice, double words, and split infinitives. Such mistakes are flagged out with colored wavy lines, highlighted background or underlines, in order to attract the user’s attention. Additionally, suggestions for every mistake are provided for users. The entire process tries to help user’s improve the quality of their documents and release them from the hand-checking work.

Unfortunately, two common phenomenon’s severely affect the performance of current language-checking tools, namely, false negatives and false positives. False negatives are those true errors that the language-checking tools fails to detect [2]. False positives represent the problems that the language-checking tools detect that are not errors [2].

Both false negatives and false positives are non-trivial troubles for language-checking tools. For example, Kies [3] discovered that the language-checking tools can only identify six of twenty most common grammar mistakes [1]. Therefore, false negatives might cause users to ignore mistakes that could be obviously identified by hand-checking. As a result, the document quality is low. On the other hand, false positives can be also a problem. Not only because perfectly acceptable words or passages are flagged erroneously, but also the imperfect suggestions provided by the language-checking tools. Erroneous suggestions would either distort the true meaning of the sentence or would create new mistakes [2].

In order to fully utilize the language-checking tools, a user needs expertise in verbal skills to address the false negatives and false positives generated by such tools. But the question is: what would happen if a user dose not masters the language well, e.g., non-native speakers?

In our study, we conduct a replication study of Galleta et al. to study whether/how language-checking tools can help to improve the documents quality of non-native speakers. It consists of a quantitative study that evaluates the performance of language-checking tools used by non-native speakers. Also, we conducted a qualitative study that reveals the credibility of such tools. Our results show that although language-checking tools are trusted by most of non-native speakers, they cannot make significant improvement in documents quality.

RELATED WORK
At first, it is tempting to assume that language-checking tools improve document quality. Language-checking tools automatically correct spelling and grammar errors.

However, Galletta et al. showed that the presence of language-checking tools in a text-processing system decreases the quality of a document. Galleta et al. asked their participants to edit a business letter using Microsoft Word 2003 [2]. The language-checking tools were on for half the subjects, and turned off for the other half [2]. Task performance was based on three types of errors: correctly identified errors, false negatives, and false positives.

Nonetheless, there are limitations with this study. First, Galletta et al. experiment has too much internal validity. Second, the test scores cannot accurately reflect a participant’s verbal ability. Third, a participant’s verbal ability greatly affects the result. Lastly, Galleta et al. does not take into consideration non-native speakers.

Very little work has been done on tools that can aid non-native speakers to detect errors that language-checking tools cannot detect. However, Park et al. researched on
collocation errors, which are used expressions, idioms and word pairings of a language. To assist non-native speakers with these parts of the English language, they developed AwkChecker [6]. It is the first end-user tool that automatically flags collocation errors and suggests replacement expressions.

AwkChecker suggest corrections for four common types of collocation errors: insertion, deletion, substitution, and transposition errors. Insertion errors insert a word in a phrase. Deletion errors occur when a word is deleted from a phrase. Substitution errors happen when a non-preferred word is used in place of a more commonly used word. Lastly, Transposition errors occur when two words are swapped [6].

To the best of our knowledge, no study has considered if language-checking tools improve the document quality of non-native speakers.

STUDY
To understand if language–checking tools improve document quality for non-native speakers, we conducted a replication study based on Galletta’s experiment.
Participants
We recruited eight participants using snowball sampling and e-mail. All participants use English as their second language.
Experiments Setup
We provided participants with two essays (denoted as essay A and essay B). Both essays have the same style and were written by the same author. Each participant was asked to edit one essay with the language-checking tools and another without the tools. The order of both essays was the same for all participants. For example, all participants first edited essay A. However, we did switch the usage of language-checking tools. More specifically, all odd numbered participants (P1, P3, P5 and P7) were asked to edit essay A with language-checking tools and essay B without tools. All even numbered participants were asked to edit essay B without tools and essay A with tools. Consequently, we minimized the confounding variables between both essays.
Research Hypothesis
Our research hypothesis was that the essays with language-checking tools on did not improve the quality of the essays.
Results
We evaluated sixteen essays that were modified by non-native speakers. We measured each participant’s time and evaluated their essays. The score and time of each participant is listed in table 1.

Participants Language tools Time
Scores Minutes
Without tools With tools Without tools With tools
P1 5.5 6.5 15 7
P2 5.5 7.5 5 7
P3 7.5 7 12 8
P4 5.5 9.5 9 12
P5 8.5 9.5 25 20
P6 6.5 6.5 22 14
P7 6.5 6.5 24 15
P8 8.5 8.5 16 14

Table 1: Participants score and time
To evaluate the performance of our participants our grade criteria was based on:

  • The readability of the essay
  • Essay structure and content
  • Spelling and grammar errors that have been correctly detected and modified.
  • Spelling and grammar errors that have been correctly detected but not properly modified.
  • False positives
  • False negatives
  • Four common types of collocation errors.

Each essay was graded with a numeric score from 1.0 to 10.0, being 10.0 the highest score possible.
Before analyzing our data, we calculated the presence of statistical outliers in our sample, which is listed in table 2.

Language tools Time
Without tools With tools Without tools With tools
Highest Value 10.59 11.61 37.75 25.86
Lowest Value 2.90 3.76 -5.75 -1.61

Table 2: Calculating the presence of statistical outliers
Based on these results, we can confirm that there are not outliers in our sample.
To analyze our data based on the scores, we decided to use a Paired Difference t-Test and performed the following steps:

Step1: State the Hypothesis
We choose a two-tailed test using these hypotheses:
H0: Language-checking tools do not improve document quality.
H1: Language-checking tools do improve document quality.

Step2: Specify the Decision Rule
Our test statistic will follow a t distribution with d.f. = n – 1 = 8 – 1 = 7. With α = 0.05 the two-tail critical value is t.025 = +- 2.365
The decision rule is:
Reject H0 if tcalc < – 2.365 or if tcalc > + 2.365
or
Reject H0 if p-value < α
Otherwise accept H0

Step3: Calculate the Test Statistic
We used Microsoft Excel 2007 to calculate the test statistic. The results are listed in Table 3.

Without tools With tools
Mean 6.76 7.68
Variance 1.64 1.70
Df 1.64
Test Statistic -1.79
T Critical two-tail +- 2.36
P(T<=t) two-tail 0.1151

Table 3: Calculating the test statistic

Step4: Make the decision
Since tcalc = -1.79 is more than the critical value (-2.36),based on a 5% level of significance, we do not reject H0. Besides, since p-value = 0.1151 is more than α = 0.05, we do not reject H0. We say that the results are statistically insignificant at a 5% level. On the other hand, we found that there is a positive correlation between the scores and time spent editing the essays, as shown in figure 1 and 2.
figure 1
Figure 1: Correlation between time and scores not
using language-checking tools.

figure 2
Figure 2: Correlation between time and scores using
language-checking tools.

Discussion
Based on our statistical analysis in the previous section, we found that language-checking tools do not significantly improve the document quality of non-native speakers. This result is slightly different from Galleta’s study [2]. They found that native speakers with strong language abilities make more mistakes when they use language-checking tools. When language-checking tools are used, users tend to ignore unflagged errors, resulting in more false negatives. However in our study, all participants were non-native speakers. Due to their limited language ability, participants made relatively a large number of false negatives without using the language-checking tools. Therefore, the number of false negatives does not necessarily increase when language tools are used. In such case, we can conclude that for non-native speakers, language-checking tools neither improve nor harm the document quality.

As shown in figures 1 and 2, participants finished their tasks faster with the language checking tools. Both figures indicate that language-checking tools do improve the task time. It may also suggest a “lazy” behavior of the users: they tend to skip the parts that do not have flagged errors.

Considering both quality and learning issues, there have been several suggestions made on how to improve language-checking tools targeted to non-native speakers.

Knutsson et al provided a series of guidelines for the design of language tools for non-native speakers [5,6]. They claimed that a good language tool should enable users to understand its capacity and limitation, and help users to learn. From the previous section, we know that language checking tools cannot correct false negatives. If users only rely on language-checking tools to check and edit their essays, they may also ignore false negatives. As a result, users are still unaware that false negatives in their essays are not correct. Therefore, they will neither improve their language ability, nor learn the limitation of the tools.

We suggest that one better way of using language tools is not to rely on them too much. We can utilize language- checking tools to eliminate simple spelling and grammar mistakes. However, additional checking and modification by hand is absolutely necessary.

There are some limitations in our study too. We are not clear about the language abilities of our participants. However, we asked each participant to edit two essays and we compared them. This minimized the effect of language ability in our result. Another limitation is that our study has a confounding variable regarding the grades of the essays. They were graded by one of the authors, who is a non-native speaker. To address this confounding variable the grader did not know which essays were modified with the language-checking tools on or off. Additionally, the grader has over twenty years of experience in the English language.

Interview Questions
In order to obtain deeper insight on the how non-native speakers think of language-checking tools, we conducted a semi-structured interview after the essay editing task. The questions used are as follows:

  • Do you use language-checking tools? [yes/no] How often do you use them?
  • Where do you usually apply language-checking tools, on emails, essays or IM messages?
  • Do you trust language-checking tools?
  • What do you think of language-checking tools?
  • Do you do extra checking or modifications after you apply language-checking tools?
  • Do you have any suggestions for further improvement on language-checking tools?

Frequency of Usage
As expected, all participants use language-checking tools very frequently. This is due to the high availability, good level of integration, ease of use, and the real-time feature of current language-checking tools. Most of the participants expressed that they would use language-checking tools as long as those tools exist in the editors they are using. Most of them gave positive comments on the accessibility of those tools. As quoted from P2:
“I almost use it on everything I write. I do not need to do copy and paste anyway. Today everything has a checker: Word has it, Hotmail has it, Gmail has it, oh, and Latex. So I just leave it open and it will do everything for me. All I need to do is my writing, and those things do not bother at all”
Only P4 expressed that he did not always use language-checking tools:
“I only use them on papers and important emails. You do not really want to control C and control V all your writing in word, too much trouble”.

Trust
Surprisingly, all participants expressed different levels of trust in language-checking tools. P6 was the only person who showed complete trust in language-checking tools. Most of them considered language-checking tools as something “trustable, but cannot totally rely on it”. Others said that they trusted checkers a lot, but they still would double check their writing. For example, as quoted from P1:
“Yes, I did it on almost every paper I wrote, as well as emails for serious purpose.”
P4 also expressed the same idea:
“I almost do hand checking every time.”
Some participants mentioned the reason why they do not show complete trust.
P5 expressed the limitation of current tools:
“It is a good tool to improve regular words and basic grammatical structures. It is not always correct and does not catch up certain words within the context”
Many participants expressed their concerns about grammar checkers. They thought spelling checkers are doing better than grammar checkers. For example, as quoted from P3:
“Mostly, I think spelling check is very useful, grammar checkers do not do well, and they have problems with special words, for example, they always flag my name.”
P4 also expressed a similar opinion:
“I absolutely trust the spelling checkers. I trust the grammar checkers when they deal with singular/plural mistakes, but I think the grammar checkers can only address certain mistakes.”

Suggestions
We also asked our participants to provide some suggestions on how to improve current language-checking tools. Most suggestions are regarding to the algorithm for scanning and checking, e.g., improving accuracy, enhancing algorithm to deal with special word etc. P3 provided an interesting suggestion from a non-native speaker prospective:
“It would be nice if they could provide the translations of the suggestions. Sometimes, I do not understand the words listed in the suggestions. If they could give translations, I can easier select the correct answer.”

CONCLUSIONS
In this paper, we first conducted a quantitative study to reveal the effect of language-checking tools on the document quality of non-native speakers. Our results showed that such tools may result in users only focusing on the errors flagged by the language-checking tools. We discovered that language checking tools did not significantly improve the quality of our participant’s documents. Moreover, compared with Galletas’ study, we found that language-checking tools did not diminish the performance of our participants.

On the other hand, a qualitative study was performed, in order to obtain a deeper insight on the how non-native speakers use language-checking tools. We found that although language-checking tools are not perfectly designed, users highly trust them.

REFERENCES
1. Connors, R.J. and Lunsford, A.A. Frequency of Formal Errors in Current College Writing, or Ma and Pa Kettle Do Research. The St. Martin’s Guide to Teaching Writing, 2nd ed. Robert Connors and Cheryl Glenn, Eds. St.Martin’s, New York, NY, 1992.
2. Galetta,D.F., Durcikova,A., Everard, A., and Jones, B.ML. Does Spelling-checking Software Need a Warning Label? Communications of the ACM. Volume 48 , Issue 7. July 2005
3. Kies, D. Evaluating grammar checkers in modern English grammar. Available at: http://papyr.com/hypertextbooks/grammar/gramchek.htm.
4. Knutsson, O., Pargman, T., and Eklundh, K. Transforming grammar checking technology into a learning environment for second language writing. Proceedings of the HLT-NAACL 03 workshop on Building educational applications using natural language processing-Volume 2 (2003), 38–45.
5. Knuttson, O., Pargman, T., Eklundh, K., and Westlund, S. Designing and developing a language environment for second language writers. Computers & Education 49, 4 (2007), 1122–1146.
6. Park, Taehyun., Lank, Edward., Poupart, Pascal., and Terry, Michael. Is the sky pure today? AwkChecker: an assistive tool for detecting and correcting collocation errors. UIST 2008: 121-130