Introduction

The Mean Opinion Score (MOS) is a popular subjective quality measure for audio, video and images. It is typically used to compare the performance of multimedia signal processing algorithms, and is most often expressed as a number from 1 to 5. The method of obtaining the MOS for a specific test condition (e.g., for an audio codec at a certain bitrate) consists of having a sufficiently large group of volunteers rate media samples from a dataset designed to be diverse enough to characterize the test condition. The MOS is then given by the average score.

Proper MOS studies are performed under laboratory conditions with pre-screened volunteers. This process can be expensive and time consuming, so many researchers choose instead to run informal subjective studies or to use objective quality measures. To obtain a subjective quality measure designed to approximate MOS but with a significantly lower cost, we propose crowdsourcing MOS studies to produce a measure we call crowdMOS (for crowdsourced-MOS). Crowdsourcing consists of outsourcing a task to a large group of people (a crowd), typically using the Internet as a recruiting tool. In this case, we use Amazon Mechanical Turk (which is an existing crowdsourcing platform) to have hundreds or thousands of workers giving subjective scores to arbitrary media files at their leisure, without leaving their homes or offices, and using their own computers. We collect these results, screen them to remove clearly inaccurate submissions and process the remaining data to produce MOS scores in a statistically significant way.

Mechanical Turk (MTurk) tasks are called HITs (human intelligence tasks), and can be used for many things other than crowdMOS. Mechanical Turk's flexibility gives it a significant learning curve for researchers who are only interested in MOS, which is why we have written this tool. The scripts contained in this package provide a simplified way of automatically generating crowdMOS HITs, collecting results submitted by workers and analyzing them.

To perform a crowdMOS user study using MTurk, you will need:

  1. a set of files to be rated from 1 to 5
  2. an HTML file giving instructions on how to rate your files
  3. a set of XML templates describing how each task will be presented to the workers on MTurk
  4. configuration parameters for your HITs, such as the payment per submitted HIT, expiration date, etc.

The scripts provided with this package significantly simplify crowdMOS tests by:

  1. automatically generating and submitting tasks from the inputs above, which are easy to prepare given the examples provided
  2. collecting the data from submitted HITs by using the MTurk API
  3. approving or rejecting HITs based on their consistency, and awarding bonuses
  4. determining mean scores and confidence intervals.

These tools can be used free of charge. If you wish to use crowdMOS in your research, we kindly ask that you reference its website and the paper below:

F. Ribeiro, D. Florencio, C. Zhang, and M. Seltzer, "crowdMOS: An Approach for Crowdsourcing Mean Opinion Score Studies," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2011.

Installation

1. Sign up for an Amazon Mechanical Turk account at http://www.mturk.com.

2. a) If using Windows

Download and install the MRI Ruby installer version 1.8.7-p302 from http://rubyinstaller.org (do not use newer versions).
Make sure to check "Add Ruby executables to your PATH" and "Associate .rb and .rbw files with this Ruby installation".

b) If using UNIX-like systems

UNIX-like systems can also run MRI Ruby, but installation instructions are distribution dependent.
For example, on Fedora Linux one can install MRI Ruby and Ruby Gems with "yum install ruby rubygems".

3. All interactive scripts should be run in a command prompt. Several scripts produce output with more than the standard 80 columns, so if you're using Windows, we recommend you configure your command prompt to be 120 columns wide. You can do this by launching the prompt, right-clicking on the title bar, accessing Properties -> Layout and setting the Window Size Width to 120.

4. (optional) If you're behind a HTTP proxy with hostname 'phost' and port 'pport', run
    set http_proxy=http://p_host:p_port (on Windows)
    export http_proxy="http://p_host:p_port" (on UNIX-like systems)
    (Don't forget the http:// prefix)

5. Install the Ruby gems which the crowdMOS tools depend on, with

    gem install ruby-aws
    gem install facets

6. Run GetRequesterStatistics.rb. This script has no side-effects (it won't create, delete or update anything in your MTurk account), but will allow you to set up your authentication information, and verify if the steps above have been successfully completed.

You will be prompted for your Amazon Web Services Access Key ID and Secret Access ID. To obtain these, visit http://aws.amazon.com, click on the Account tab and then on the "Security Credentials" link.

GetRequesterStatistics.rb should print out an output resembling the following:

    |  1 | Assignments Available    |      0 |
    |  2 | Assignments Accepted     |      0 |
    |  3 | Assignments Pending      |      0 |
    |  4 | Assignments Approved     |      0 |
    |  5 | Assignments Approved (%) |  100.0 |
    |  6 | Assignments Rejected     |      0 |
    |  7 | Assignments Rejected (%) |    0.0 |
    |  8 | Assignments Returned     |      0 |
    |  9 | Assignments Abandoned    |      0 |
    | 10 | Total Reward Payout      |   0.00 |
    | 11 | Average Reward Amount    |   0.00 |
    | 12 | Total Reward Fee Payout  |   0.00 |
    | 13 | Total Bonus Payout       |   0.00 |
    | 14 | Total Bonus Fee Payout   |   0.00 |
    | 15 | HITs Created             |      0 |
    | 16 | HITs Completed           |      0 |
    | 17 | HITs Assignable          |      0 |
    | 18 | HITs Reviewable          |      0 |
    | 19 | Est. Reward Liability    |      0 |
    | 20 | Est. Fee Liability       |      0 |
    | 21 | Est. Total Liability     |      0 |

If you see a table similar to the one above, then your installation was successful. Otherwise, the error messages should indicate what went wrong.

Package contents

This package is composed of the following scripts:

AnalyzeAssignments.rb:
Analyzes assignments (submitted HITs) collected with GetAssignments.rb, prints mean scores, confidence intervals and other statistics. This script is also used to automatically approve assignments.

CreateHITs.rb:
Creates and submit HITs to MTurk, using the templates described later in these instructions.

ExpireDeleteHITs.rb:
Expires and/or deletes HITs that have already been submitted and approved/rejected.

ExtendHITs.rb:
Extends the lifetime of HITs, or the number of maximum assignments associated with each HIT.

GetAssignments.rb:
Retrieves completed assignments.

GetRequesterStatistics.rb:
Retrieves requester statistics for this MTurk account.

GrantBonuses.rb:
Analyzes the retrieved assignments and grants bonuses based on performance.

MergeResults.rb:
Merges two or more result files, which must have the tab-delimited format of MOS.hit_results (which is created automatically, and is not meant to be edited manually).

ResetLocalState.rb:
Resets the local state, destroying all bookkeeping. This script deletes all retrieved assignments and records of created HITs, so only use at the end of an experiment.

WorkerAccess.rb:
Blocks or unblocks a worker from accepting your HITs. Use only if a worker has submitted obviously bad results, because blocks can lead to workers being banned by Amazon.

Usage information can be obtained for each script by invoking "<script-name> --help" on the command prompt.

User interfaces

HITs can either be designed using MTurk's form templates, or using an external HTML file.

Using MTurk's form templates is easier because they don't require any programming. These templates are written in XML, and describe simple forms with only a few types of input elements, which the worker should fill out with his answer before submitting results. We use this approach for audio quality assessment, because all we need for each file is one audio player followed by radio buttons corresponding to 1-5 scores.

On the other hand, multiple-stimulus image quality assessment can only be performed accurately if the worker can compare overlayed versions of the same image. We have implemented this with dynamic HTML, where clicking a button replaces an image by another. This level of interactivity is not allowed by MTurk's templates. Thus, we implement this functionality with dynamic HTML, and use an external HTML file for the HIT. This external HTML must be hosted on your own server, along with the files which will be scored.

The following section explains how to use the form templates and external HTML files which are packaged with crowdMOS (so you don't have to write your own from scratch).

Typical usage

1) Reset your local configuration files by running ResetLocalState.rb. This will also initialize your config.txt file, if it does not exist.
WARNING: don't run ResetLocalState.rb if you are in the middle of a study, because it will delete all book-keeping information, including HITs submitted by workers and bonus records.

2) Collect a list of files which you would like rated on a MOS scale. We assume that the purpose of the MOS test is to compare two or more techniques designed to perform the same task (speech enhancement, dereverberation, text-to-speech, etc.). To ensure a fair comparison, you must compare methods across the same input files (this is a hard requirement, and CreateHITs.rb will refuse to run otherwise).

For audio quality assessment, our templates use the WordPress Audio Player, which only plays MP3 files. Thus, you should convert your files to MP3 with a high enough bitrate such that artifacts due to MP3 do not appear. Upload the MP3 files to a webserver.

Edit MOS.hit_input, which is a tab-delimited text file. Each line describes a unique file to be scored, and has the following format:

<sentence-name> <TAB> <algorithm> <TAB> <URL>

where:

<sentence-name> identifies the unprocessed file which was used in combination with an algorithm to generate the file to be scored
<algorithm> identifies the method used to generate the file
<URL> is:
(i) the URL where the file can be downloaded, if using the MTurk form templates
(ii) the filename of the file to be scored, if using an external HIT. In this case, all files to be scored must be in the same webserver folder as the external HIT HTML.

The <sentence-name> and <algorithm> fields will not be shown to the MTurk workers, and they should contain a description which is meaningful to you (the researcher). No two lines should have the same (sentence-name, algorithm) pair.

3) Edit MOS.hit_properties and change the descriptions, payments and qualifications to match your project.

If you're using the MTurk form templates (used exclusively for audio quality tests)

Edit the MOS.hitquestion* files to match your project. The CreateHITs.rb script will use MOS.hitquestionheader, MOS.hitquestiontemplate and MOS.hitquestionfooter to generate HITs, by instantiating fields with the format <%= @variable %> contained in the MOS.hitquestion* files. When CreateHITs.rb is run, it will ask you (the researcher) how many files should be scored per HIT, which we call the SamplesPerHIT parameter. The HIT template which is effectively uploaded to MTurk is a QuestionForm XML file, and is generated automatically by concatenating the contents of MOS.hitquestion* like so:

    MOS.hit_question_header
    MOS.hit_question_template
    	... (MOS.hit_question_template is repeated SamplesPerHIT times)
    MOS.hit_question_template
    MOS.hit_question_footer

For each HIT, exactly SamplesPerHIT files are selected at random from the collection described in MOS.hitinput, under the design constraints specified to CreateHITs.rb with command-line switches. The <%= @URL %> field contained in MOS.hitquestion_template is instantiated with the each file's URL. The <%= @index %> field is instantiated with an integer from 1 to SamplesPerHIT, and identifies the question number. The <%= @identifier %> field is instantiated with an automatically generated string which uniquely identifies each file. It is used internally by the MTurk framework and by these scripts as a Question Identifier and must not be changed or renamed, as it is used for book-keeping purposes.

If you're using an external HIT HTML (used for image quality tests)

Edit the external_HIT.html and instructions.html files, and change the instructions to suit your own requirements. You can ignore the Javascript sections.

4) The config.txt file has several parameters which control how the scripts work. Some of them are user-configurable, and are described below.

Bonuses

To motivate workers to work on as many HITs as possible and to pay attention, we use a bonus structure that rewards quality and quantity.

To measure quantity, we look at the number of HITs a worker has submitted for a given MOS study. Workers who submit at least 'BonusMinAssignments' HITs are rewarded with 'BonusForQuantity' USD per submitted HIT.

To measure quality, we correlate a worker's submitted scores with the mean scores. We rank users using their correlation coefficients (where higher correlation to the mean is better). Workers whose scores are in the top 50% are rewarded with an additional 'BonusForQualityTop50Pct' USD per submitted HIT. Workers whose scores are in the top 10% are rewarded with an additional 'BonusForQualityTop10Pct' USD per submitted HIT. (Note that all bonuses are cumulative)

IMPORTANT: before running an experiment, always compute the expected bonus payout. If these bonus amounts aren't properly chosen, their cost to you (the requester) can significantly exceed the payout for the HITs themselves. The MTurk workers will contact you asking for promised bonuses if you are late paying them, and if you do not pay them, your reputation as a requester will be compromised. MTurk workers have forums where they warn each other of non-paying requesters, and it is important for a requester to maintain a good reputation in order to get good throughput.

Other Parameters

MinSampleWorkingTime: the minimum time (in seconds) a worker needs to properly score a media file (for audio, this should exceed the audio sample's duration). Assignments submitted in less than (SamplesPerHIT * MinSampleWorkingTime) seconds will be ignored by the scripts.
RequesterName: the name which will be presented to MTurk users when signing automatically generated e-mails (for example, containing bonus reports).
Server: either Sandbox or Production, describes which MTurk server to use. The Sandbox should be used for testing, since it is free to use. Always test using the Sandbox before deploying to the Production server.

5) Run CreateHITs.rb to create HITs.

Your must use the --design switch to specify the test design. The options are:

  • random_sentences: similar to the ACR test from ITU-T P.800, where each HIT is created by drawing a user informed number of files from the sample pool, without replacement.
  • diffsentences: a variation of randomsentences, with the constraint that a HIT never contains two samples created from the same test signal. This constraint enforces the concept that ACR is a single-stimulus test with an absolute scale, where one should not be tempted to make relative judgements.
  • same_sentence: a simple multiple-stimulus test, where each HIT only has samples created from the same test signal. This encourages workers to make relative comparisons, and allows them to discriminate features that would not be noticeable otherwise.
  • mushra: the MUSHRA (multiple-stimulus with hidden reference and anchor) test from ITU-R BS.1116-1. It is similar to the simple multiple-stimulus test, except that: (i) a reference is presented to the subject before the test, and labeled with a 5.0 score, (ii) this same reference is hidden as one of the test files and (iii) hidden among the test files is also a version of the reference, degraded with a method decided a priori, called the anchor. The anchor provides a baseline to minimize distortions of the score chart due to the use of relative comparisons.
    The references and anchors must be identified using the algorithm names REF and ANC in MOS.hit_input.
  • dsis: the DSIS (double-stimulus impairment scale) test from ITU-R BT.500, where each HIT only presents two samples: a unprocessed reference and a processed version. Both images are labeled as such. The worker must score how much the processing degraded the reference. This is conceptually equivalent to the DCR (degradation category rating) test from ITU-T P.800.
    The references must be identified using the algorithm name REF in MOS.hit_input.

6) You can check the progress of your HITs using the MTurk web interface. To retrieve results, use GetAssignments.rb. GetAssignments.rb can be called many times to update intermediate results.

You can analyze results at any time with AnalyzeResults.rb. By running AnalyzeResults.rb without any switches, it will print at the end how it intends to approve and reject pending assignments. To commit the approvals and rejections to the MTurk server, use the -c switch. Note that these approvals and rejections are affected by parameters from the config.txt file.

You can automatically grant bonuses with GrantBonuses.rb. By running GrantBonuses.rb without any switches, it will print the bonus e-mails which it plans to send to each worker with his bonus summaries, and the total amount of bonus which it will pay out (in USD). To commit the bonuses to the MTurk server, use the -c switch.

All scripts feature book-keeping, so previously submitted HITs won't be lost and bonuses won't be awarded twice. Sometimes the MTurk server hiccups and doesn't respond correctly to an approval or bonus command. But the scripts keep track of this, and will tell you to re-run the appropriate script to retry the failed commands.

7) To finish a study, you can either wait for all HITs to be submitted, or you can call ExpireDeleteHITs.rb, which expires active HITs.

Like the other switches, ExpireDeleteHITs.rb only commits expirations when using the -c switch. By default, it only expires HITs. To also delete HITs, add the -d switch. By default, it only acts on HITs which were created after the last local state reset. To act on all the HITs on your MTurk account, use the -a switch. For more details, use --help.

Note that ExpireDeleteHITs.rb can only be used to delete HITs that have already been approved or rejected. As of this writing, Amazon keeps expired HITs in their server for around 100 days, so even if they have been deleted, their HIT-Id is still guaranteed to exist and their data can be retrieved for some time. As long as you do not delete MOS.hit_success (which is updated by CreateHITs.rb), you will be able to access these results.

Contact information

If you have any questions, suggestions or criticism, feel free to contact the authors:

Flavio Ribeiro
Ph.D. Candidate in Electrical Engineering
Signal Processing Laboratory, University of São Paulo
fr@lps.usp.br

Dinei Florêncio
Researcher, Communication and Collaboration Systems Group
Microsoft Research Redmond
dinei@microsoft.com

Cha Zhang
Researcher, Communication and Collaboration Systems Group
Microsoft Research Redmond
chazhang@microsoft.com

Mike Seltzer
Researcher, Speech Technology Group
Microsoft Research Redmond
mseltzer@microsoft.com

Last edited May 19, 2012 at 11:56 PM by flavioribeiro, version 3

Comments

No comments yet.