Skip to content

Extracting Instagram Data – Part 1

For the next two posts, I’m going to introduce some of the techniques I’ve been using to mine content on Instagram, without using their API. This will include extracting a set of posts for a particular hashtag, extracting user information, and extracting a public users timeline. This first post will introduce some of the concepts of Instagram’s REST service it’s front-end JS client uses to load data from, and start by showing how we can use it to extract posts containing a hashtag.

If you’re just interested in the python code for this example, it can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.

Before I go any further, I’d like to point out this is not an exhaustive search of hashtags. It only returns a subset, limited to what Instagram terms “Most Recent”.

This approach looks at using Instagram’s ‘Explore’ search function: https://www.instagram.com/explore/tags/

For example, if we want to explore #food, we can use the query: https://www.instagram.com/explore/tags/food/

When querying this URL, we are greeted with a search page from Instagram showing us top posts, and most recent. If we scroll down, there is a load more button, and an infitie scroll displaying more posts. This approach will be mining that infinite scroll.

First, lets use a cool trick with Instagram. On most pages containing database data, if you append the query param __a=1, you can get the JSON data for that page.

In our food example https://www.instagram.com/explore/tags/food/?__a=1 will give us the following results:

{
    tag: {
        media: {
            count: 195222596,
            page_info: {...},
            nodes: [...]
        },
        content_advisory: null,
        top_posts: {
            nodes: [...]
        },
        name: "food"
    }
}

Now as you can see, there’s a lot of information there, but it matches up with areas on the page. ‘media’ contains information about the most recent posts, while top_posts has our “Top Posts” list.

We also find the content of each post are in “node” arrays. A mocked up example can be found below:

{
    code: "AjfIaiwelA",
    dimensions: {
        width: 1080,
        height: 1080
    },
    comments_disabled: false,
    owner: {
        id: "34578678"
    },
    comments: {
        count: 5
    },
    caption: "caption of photo",
    likes: {
        count: 3
    },
    date: 1480069513,
    thumbnail_src: "http://thumbnail.src",
    is_video: true,
    id: "1391246372820331800",
    display_src: "http://image.src"
}

Given a JSON object like that, we can easily extract the content using something like Python or Groovy, and convert it into a usable object, or just save the document directly into a JSON database like MongoDB.

Now obviously, not every available post will be in the first query, so next we must deal with paging. If we go back to our ‘media’ object, there is a ‘page_info’ object containing some cursor information. Here we have a useful boolean value has_next_page that can indicate whether there is a next page, and and end_cursor we can use in our query.

The next step in reverse engineering a system like this is to have a look at the network request for when we click the “Load more” button at the end of the page. This triggers a POST query to “https://www.instagram.com/query/” which is Instagrams query end point.

If we look at the query, we see there are some parameters. Here’s an example:

q: ig_hashtag(food) {
    media.after(J0HWFJqMAAAAF0HWFJqHAAAAFjwA, 10) {
        count,
        nodes {
            caption,
            code,
            comments {
                count
            },
            comments_disabled,
            date,
            dimensions {
                height,
                width
            },
            display_src,
            id,
            is_video,
            likes {
                count
            },
            owner {
                id
            },
            thumbnail_src,
            video_views
        },
        page_info
    }
}
ref: tags::show
query_id:

Interestingly we see that a good bulk of that reflects the response we get back in the JSON code. In fact, you can even remove some of the parameters in the query, to make a smaller request. Something else to point out here is our end cursor is found in the query media.after(), and we can even bump up the number of posts returned from 10, to a slightly higher number.

Now we have the query sorted out, let’s take a look at the headers.

accept:*/*
accept-encoding:gzip, deflate, br
accept-language:en-GB,en;q=0.8
cache-control:no-cache
content-length:527
content-type:application/x-www-form-urlencoded
cookie: "Long cookie string"... csrftoken=NEBpBZXbmq0dqh9mBTfixxm4L9psWdrE;
origin:https://www.instagram.com
pragma:no-cache
referer:https://www.instagram.com/explore/tags/food/
user-agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36
x-csrftoken:NEBpBZXbmq0dqh9mBTfixxm4L9psWdrE
x-instagram-ajax:1
x-requested-with:XMLHttpRequest

Typically with a system like this, I’ll include all of the static headers in a browser anyway, but here the important values come from the x-csrftoken. X-csrftokens are used against CSRF attacks, so you need to obtain one. Typically, I’ll make a head query to instagrams homepage, and extract the whole cookie string, and the csrftoken value in the cookie string to use as the value for my x-csrftoken parameter.

So with all of that information, you can now go and make a POST request to Instagram for the next page in the results.

The rest of the process is then just repeating the same steps as above. The only notable difference is, rather than having a parent ‘tag’ object which our ‘media’ object was enclosed in, all further POST requests will just return the ‘media’ tag, so you don’t have to include the ‘tag’ when accessing the JSON response.

I’ve written a quick python example for the whole process, and can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.

In a subsequent post, I’ll show how to extract comments, likes, and a full public timeline.

23 Comments

  1. Hey Tom, how are you?

    Did you manage to use a similar post request to retrieve followers from a given user?
    I believe I’ve done pretty much the same as you (but with different parameters) with no success…

    Cheers

  2. Hi Diego,

    I’m very good thank you, hope you’re doing great too!

    I haven’t actually tried extracting those before, but having a look through at it, it uses the same request structure as extracting comments.

    I might have a crack at modifying my script to do it, and let you know if I’m successful.

    Off the top of my head, have you made sure you’re include a CSRF token header, with the CSRF token Instagram gives you? As well as supplying a UserAgent string that mimics a browser like Chrome?

    Tom

    • I think so!

      I managed to login, since followers list won’t be shown in anonymous browsing, and post a request with the data structure observed in developer mode.
      The response code is fine (200) but no data is shown, only the number of nodes in such list.

      I came across your post by chance, since I refuse to believe that it might just be some sort of protection from instagram to avoid 3rd party APIs.

      But well, it might just be the case..
      I am assuming a coding error and will use your script to avoid conceptual mistakes.
      I will let you know if any progress is made.

      Diego

      • Ok, got it to work.

        Here’s a Gist of the script.

        https://gist.github.com/tomkdickinson/a093d30523dd77ae970f3ffcf26e1344

        Just needs an instagram username/password to work, and replace “justintimberlake” with whomever you want to extract.

        Take a look at the ‘q’ param in get_following_params, and get_followed_by_params.

        If you were just getting nodes and no data, I suspect the query parameters weren’t quite right.

        • I’ve added the piece of code as a comment to your gist!

          what versions of python / requests / json are you using?

          Diego

          • Moved the conversation to the gist for now.

  3. LG LG

    Hi! I was wondering if you had any advice regarding scraping the JSON data for extracting comments on a post. I’ve got some code working which picks off the comments and the users who made them. But my issue is that not ALL of the comments are displayed in the JSON data due to the “load more comments” feature. This is written in the JSON using “has_next_page” / “end_cursor” like pagination. I know how to deal with the pagination of a profile’s main page i.e. https://www.instagram.com/name/?__a=1&max_id=%5Bvalue from end_cursor]. However, when I try to use this method to page through the list of comments, it does not work. Perhaps there is a different query (not ‘?__a=1’)? Thanks!

    • So the pagination for the comments needs to be POSTed with the same /query url that the pagination for retrieving more posts uses. However, there are some differences in the parameters you post. I am planning on writing something about comment collection, but my immediate advice would be to open up something like dev tools in chrome, and look at the AJAX query that is sent when you click “load more comments”. You’ll see all the query parameters that need to be sent, as well as how you can modify it to smoothly scroll through the comments with the end_cursor parameter.

  4. Fletcher Fletcher

    This is great, thanks for sharing! Helped me get going quickly 🙂

  5. Why do you use ‘__a’? It works, but how have found it out? Why not ‘__b’, for example? Where did you get information from? Which initial information or inferences help you to discover that? How can we define whether an iternal service/ site has an opportunity to get data directly through json? Which parameter should we use? Is ‘a’ a standard in the industry?
    Have fun!

    • The URL format isn’t really a standard in the industry (at least not one I’ve come across), but the way that a page may make an AJAX request and present that information is. Typically if you understand REST systems, or systems that don’t adhere to REST but use AJAX, you can reverse engineer someone else’s system.

      As for the URL, you find that information out by using something like chrome dev tools, or some other network requests analysing tool. That way you can see which URL’s are loaded when certain events happen on a page.

      For example, when on Instagram you can press F12 in Chrome, and go to the Network tab. If you go to someones profile, you’ll see a request made to “screen_name/?__a=1”, and if you look at the response or preview of that request, you see a JSON document with most of the information on the page.

      Typically the rule of thumb I employ when writing scripts like this, is to pretend my script is a browser. If you copy all the requests that your browser makes when accessing pages like that, you can typically make a non optimised script that works, and then it’s just trial and error to see what each parameter does.

  6. peter_newick peter_newick

    Hi Tom!

    I really like your concept, and the example script you provided could get some very useful for data for a semester project of mine. However, I am new to Python and could not figure out how to go over the next pages, so I still can only get the posts from the first page. May I ask you for some hints on this, like how to tweak the example code? Thanks a lot!

  7. Séti Séti

    Hi Tom,

    This string ‘__a=1’ was cool and helpful, only one thing I am checking now is If i am not logged into Instagram then the link will stop working. I feel that was working before even without being logged into Instagram and now stopped!

    Let me know if there is something that I have missed!

    Cheers,
    Seti

    • Still working for me when logged out.

      Might be daft to ask, but you’re doing ?__a=1 correct?

      • Jasper Cashmore Jasper Cashmore

        I’m having the same issue with recent location feed (https://www.instagram.com/explore/locations/214991417/?__a=1). Working fine in my logged in Instagram chrome session, but pasting it into an incognito version returns Page not available.

        I’m currently in a long battle trying to work out how to login programmatically but it’s doing my head in.

        • Odd. It might be a geolocation thing. I’ve had someone else say they can’t access it in the past, and they weren’t based in the UK.

          I have written something in the past though that can help you log in. I’ll see if I can dig it out and share it in a new blog post.

    • Melanie Moy Melanie Moy

      A second thanks for this ‘__a=1’ trick. You just saved my hackathon project!

  8. Joy Kab Joy Kab

    Hi Tom,

    I noticed when I run the instagram_search I do not get any data. I did try browsing the URL, e.g., https://www.instagram.com/explore/tags/food, asks to login first. It worked months ago, but not today.

    Are you aware of any change in Instagram recently?

    Regards
    /Joy

  9. Rafael Rafael

    Hi
    Very useful this script. I ´a m beginner programmer. I trying to do this in a java. But I have difficult to understand the POST query String for more results.

    I know how to get all the information to put in the string. But I don’t know how to format the string after the “/query/”

    “https://www.instagram.com/query/” _______

    Thanks

  10. Hi Tom! Thank you very much for your solution! I was wondering if you can give me a tip how to extract username if I get just userid. Thanks a lot in advance!

  11. Max Max

    Amazing job! Im using your python script for my university project.
    But i have a problem after 14-20 succesful instagram posts scraped… there is an error:
    ERROR:root:Expecting value: line 1 column 1 (char 0)
    Does anybody know this issue ? and a possible solution?

    • Hi Max,

      Sorry for the slow reply, I’ve been very busy recently.

      Instagram has updated their code, so the old solution was invalid.

      However, I’ve spent some time today fixing this, so the script should now extract hashtags again.

      Tom

Leave a Reply