A late night quest for automatic language translation

Introduction:

 

Several weeks ago I had a cool idea, which turned out to be a programming challenge (which will probably be the subject of my next blog when I finish toying around with it long enough to share it with you). Sometimes I download a subtitle for a movie which is out of sync with the version I have. However, finding an English subtitle synced to a given version of the movie is usually very easy, so why not create a program which can sync my subtitle by correlating it with the English subtitle?

 

The first thing you have to get is a good translation for the out-of-sync subtitle which you already have. This sounds easy enough – just go to Google translate and paste, pick “translate to English” and copy it out, right? Well – that works, but what if I want to do it automatically from the program code?

I remembered I have used the Google translate API several years back, so naturally I dug up my old code and got to work, but I quickly realized it no longer functions. After that, I tried several other open source GTranslate APIs for C# projects, before finally stumbling upon the fact that GTranslate is now a paid only subscription service…

 

Searching the web for an alternative came up with one other option – Microsoft (Bing) translation, which has Hebrew translation. If yours is a more common language, you have several more options, but seeing that Hebrew is usually the last language to be supported by any language engine (OCR translation, text to speech etc.), my choice was limited to the biggest players.

 

Microsoft allows the translation of 2 million characters per month with the free subscription. In order to get such a subscription, you need to:

This last step took me a while since it isn’t mentioned anywhere, and since the link to it is placed in fine print hidden amongst many others at the bottom of your account info page.

footer.png

Provide any ClientID and name, redirect URI can be “google.com” or anything else you may think of that can pass for a URI

regApp.png

 

 

Various APIs, no good examples:

 

After subscribing, you see that there are many online examples of how to use Microsoft’s translation API. However, many of the examples are obsolete, as the service has undergone some changes in what I imagine to be the past year or so.

 

 

Since processing one line at a time is incredibly time consuming, and I can’t imagine many applications which need a “one line at a time” translation interface (mine sure doesn’t), I needed something else.

 

First I tried to work around it – trying different delimiters for end of line, but all of them got changed or removed by the translation engine. Giving up on this approach, I turned to play with http://api.microsofttranslator.com/V2/HTTP.svc ,which is the old REST service for the translation API, This service is still usable as there are recent examples of its use with the new authorization scheme and, fortunately, it still contains a TranslateArray method for batch translating multiple lines.

 

After combining several examples into my code, and optimizing the xml created (in the MS example each line contained the full xml namespace), here is a working example for using Microsoft translation:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.Serialization;
using System.Web;
using System.Net;
using System.Threading;
using System.Runtime.Serialization.Json;
using System.IO;
using System.Xml.Linq;

namespace SyncSubsByComparison
{
    public class BingTranslator 
    {
      
        static string _clientId = "<YourIdHere>";
        static string _clientSecret = "<YourSecretHere>";
        AdmAccessToken admToken;

        public BingTranslator()
        {
            AdmAuthentication admAuth = new AdmAuthentication(_clientId, _clientSecret);
            admToken = admAuth.GetAccessToken();
        }

      
        /// <summary>
        /// translates lines by using TranslateArray function with batches of X lines each time, so not to exceed the limit.,
        /// </summary>
        /// <param name="translateArraySourceTexts"></param>
        /// <param name="toLang"></param>
        /// <param name="fromLang"></param>
        /// <returns></returns>
        public IEnumerable<string> TranslateLines(IEnumerable<string> translateArraySourceTexts, string toLang, string fromLang = null)
        {
            StringBuilder sb = new StringBuilder();
            var translationLines = translateArraySourceTexts.Select(l => "<ar:string>" + System.Web.HttpUtility.HtmlEncode(l) + "</ar:string>");

            var translationGroups = translationLines
            .Select((line, index) => new { line, index })
            .GroupBy(g => g.index / 250, i => i.line);

            List<string> batches = new List<string>();
            foreach (var group in translationGroups)
            {
                batches.Add(group.Aggregate((a, b) => a + "\n" + b));
            }
            List<string> translatedLines = new List<string>();

            string authToken = "Bearer " + admToken.access_token;

            foreach (string batch in batches)
            {

                string uri = "http://api.microsofttranslator.com/v2/Http.svc/TranslateArray";
                string body = "<TranslateArrayRequest>\n" +
                                 "<AppId />\n" +
                                 "<From>{0}</From>\n" +
                                 "<Options xmlns:trans=\"http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2\">\n" +
                                    " <trans:Category  />\n" +
                                     "<trans:ContentType>{1}</trans:ContentType>\n" +
                                     "<trans:ReservedFlags />\n" +
                                     "<trans:State />\n" +
                                     "<trans:Uri />\n" +
                                     "<trans:User />\n" +
                                 "</Options>\n" +
                                 "<Texts xmlns:ar=\"http://schemas.microsoft.com/2003/10/Serialization/Arrays\">\n" +
                                    batch +
                                 "\n</Texts>\n" +
                                 "<To>{2}</To>\n" +
                              "</TranslateArrayRequest>";
                string reqBody = string.Format(body, fromLang, "text/plain", toLang);
                // create the request
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
                request.Headers.Add("Authorization", authToken);
                request.ContentType = "text/xml";
                request.Method = "POST";

                using (System.IO.Stream stream = request.GetRequestStream())
                {
                    byte[] arrBytes = System.Text.Encoding.UTF8.GetBytes(reqBody);
                    stream.Write(arrBytes, 0, arrBytes.Length);
                }

                // Get the response
                WebResponse response = null;
                try
                {
                    response = request.GetResponse();
                    using (Stream stream = response.GetResponseStream())
                    {
                        using (StreamReader rdr = new StreamReader(stream, Sys-tem.Text.Encoding.UTF8))
                        {
                            // Deserialize the response
                            string strResponse = rdr.ReadToEnd();
                            Console.WriteLine("Result of translate array method is:");
                            XDocument doc = XDocument.Parse(@strResponse);
                            XNamespace ns = "http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2";
                            int soureceTextCounter = 0;
                            foreach (XElement xe in doc.Descendants(ns + "Trans-lateArrayResponse"))
                            {

                                foreach (var node in xe.Elements(ns + "Translat-edText"))
                                {
                                    translatedLines.Add(node.Value);
                                }
                            
                            }
                        }
                    }
                    
                }
                catch (WebException e)
                {
                    ProcessWebException(e);
                    return null;
                }
                finally
                {
                    if (response != null)
                    {
                        response.Close();
                        response = null;
                    }
                }
                
            }

            return translatedLines;
        }

        public string DetectLanguage(string text)
        {
            string headerValue;
            
            try
            {
                // Create a header with the access_token property of the returned token
                headerValue = "Bearer " + admToken.access_token;
                return DetectLangInternal(headerValue, text);
            }
            catch (WebException e)
            {
                ProcessWebException(e);
            }
            return null;
        }

        private static string DetectLangInternal(string authToken, string textToDetect)
        {
            //Keep appId parameter blank as we are sending access token in authorization header.
            string uri = "http://api.microsofttranslator.com/v2/Http.svc/Detect?text=" + textToDetect;
            HttpWebRequest httpWebRequest = (HttpWebRe-quest)WebRequest.Create(uri);
            httpWebRequest.Headers.Add("Authorization", authToken);
            WebResponse response = null;
            string languageDetected;
            try
            {
                response = httpWebRequest.GetResponse();
                using (Stream stream = response.GetResponseStream())
                {
                    System.Runtime.Serialization.DataContractSerializer dcs = new Sys-tem.Runtime.Serialization.DataContractSerializer(Type.GetType("System.String"));
                    languageDetected = (string)dcs.ReadObject(stream);
                }

            }

            catch
            {
                throw;
            }
            finally
            {
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
            }
            return languageDetected;
        }

        private static void ProcessWebException(WebException e)
        {
            Console.WriteLine("{0}", e.ToString());
            // Obtain detailed error information
            string strResponse = string.Empty;
            using (HttpWebResponse response = (HttpWebResponse)e.Response)
            {
                using (Stream responseStream = response.GetResponseStream())
                {
                    using (StreamReader sr = new StreamReader(responseStream, System.Text.Encoding.ASCII))
                    {
                        strResponse = sr.ReadToEnd();
                    }
                }
            }
            Console.WriteLine("Http status code={0}, error message={1}", e.Status, strResponse);
        }

        public bool IsOperational
        {
            get { return !string.IsNullOrWhiteSpace(_clientId) && !string.IsNullOrWhiteSpace(_clientSecret); }
        }
    }
 
    [DataContract]
    public class AdmAccessToken
    {
        [DataMember]
        public string access_token { get; set; }
        [DataMember]
        public string token_type { get; set; }
        [DataMember]
        public string expires_in { get; set; }
        [DataMember]
        public string scope { get; set; }
    }

    public class AdmAuthentication
    {
        public static readonly string DatamarketAccessUri = "https://datamarket.accesscontrol.windows.net/v2/OAuth2-13";
        private string clientId = "";
        private string clientSecret = "";
        private string request;
        private AdmAccessToken token;
        private Timer accessTokenRenewer;

        //Access token expires every 10 minutes. Renew it every 9 minutes only.
        private const int RefreshTokenDuration = 9;

        public AdmAuthentication(string clientId, string clientSecret)
        {
            this.clientId = clientId;
            this.clientSecret = clientSecret;
            //If clientid or client secret has special characters, encode before sending request
            this.request = string.Format("grant_type=client_credentials&client_id={0}&client_secret={1}&scope=http://api.microsofttranslator.com", HttpUtility.UrlEncode(clientId), HttpUtility.UrlEncode(clientSecret));
            this.token = HttpPost(DatamarketAccessUri, this.request);
            //renew the token every specfied minutes
            accessTokenRenewer = new Timer(new Timer-Callback(OnTokenExpiredCallback), this, TimeSpan.FromMinutes(RefreshTokenDuration), TimeSpan.FromMilliseconds(-1));
        }

        public AdmAccessToken GetAccessToken()
        {
            return this.token;
        }


        private void RenewAccessToken()
        {
            AdmAccessToken newAccessToken = HttpPost(DatamarketAccessUri, this.request);
            //swap the new token with old one
            //Note: the swap is thread unsafe
            this.token = newAccessToken;
            Console.WriteLine(string.Format("Renewed token for user: {0} is: {1}", this.clientId, this.token.access_token));
        }

        private void OnTokenExpiredCallback(object stateInfo)
        {
            try
            {
                RenewAccessToken();
            }
            catch (Exception ex)
            {
                Console.WriteLine(string.Format("Failed renewing access token. Details: {0}", ex.Message));
            }
            finally
            {
                try
                {
                    accessTokenRenew-er.Change(TimeSpan.FromMinutes(RefreshTokenDuration), TimeSpan.FromMilliseconds(-1));
                }
                catch (Exception ex)
                {
                    Console.WriteLine(string.Format("Failed to reschedule the timer to renew access token. Details: {0}", ex.Message));
                }
            }
        }


        private AdmAccessToken HttpPost(string DatamarketAccessUri, string re-questDetails)
        {
            //Prepare OAuth request 
            WebRequest webRequest = WebRequest.Create(DatamarketAccessUri);
            webRequest.ContentType = "application/x-www-form-urlencoded";
            webRequest.Method = "POST";
            byte[] bytes = Encoding.ASCII.GetBytes(requestDetails);
            webRequest.ContentLength = bytes.Length;
            using (Stream outputStream = webRequest.GetRequestStream())
            {
                outputStream.Write(bytes, 0, bytes.Length);
            }
            using (WebResponse webResponse = webRequest.GetResponse())
            {
                DataContractJsonSerializer serializer = new DataContractJsonSerializer(typeof(AdmAccessToken));
                //Get deserialized object from JSON stream
                AdmAccessToken token = (AdmAc-cessToken)serializer.ReadObject(webResponse.GetResponseStream());
                return token;
            }
        }
    }

}

 

Points of interest:

  1. There’s a limit to the length of the string you can send to the translator service, so I used groups of 250 lines. Partitioning is achieved easily with a GroupBy linq expression. (You might want to play around with this number if you have longer lines as subtitles tend to be short)
  2. The Security token should be renewed automatically every 9 minutes (as it times out at 10 minutes). I didn’t check if this works correctly but its MS code taken from the example.
  3. The code above includes authentication token creation classes and a language detection function as well.
  4. Html encoding is applied to the lines before they are sent to the translator since many languages contain characters that might create problems while passing through the communication channels.
  5. I do have one good thing to say about Microsoft’s translation engine: After getting this up and running, I discovered that the translation engine gives good results which are sometimes better than Google Translate. (Good work on the engine, Microsoft, less so on the API.)

Conclusion:

Automatic translation can be a starting point for many applications; it can serve academic research as well as corporate software. Getting this to work was hard, much harder than it should have been if there was an existing decent service / example online. (I hope this post helps with the latter.)

 

I think Microsoft should really make using their translation APIs more approachable, as I can’t imagine that making life harder for beginners will land them a lot of paying customers. I must also wonder at Google’s decision to make translation API a paid only service, with no free tier, as this looks out of line with their policy of supplying a rather generous free tier for most of their other products and may discourage general testing and adoption of this API, which is not as well-known as Google’s search services.

 

As for my subtitle syncing application, a simple dictionary translation might have been enough, but I went for what I thought was the fastest option available, and stumbled into this adventure on the way.

 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
I've been all over the coding world since earning my degrees have worked several years in c++ and then several in java, finally setteling i...
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.